You are preparing for a senior Azure/DevOps role (8–9 years expected depth). Your real experience is ~6 years — that is enough if you demonstrate senior-level thinking: architecture, security, trade-offs, failure scenarios, and cross-team ownership.
| Junior (❌) | Senior (✅) |
|---|---|
| "Private endpoint uses private IP" | Same + Private Link, Private DNS Zone, disable public access, NSG on PE subnet, gov compliance reason |
| Lists tools | Explains why you chose them and what breaks without them |
| One-line definitions | Definition + HHS example + follow-up readiness |
| "I managed AKS" | "I owned AKS platform — create, secure, upgrade, DR, cost, incidents" |
| Section | Purpose |
|---|---|
| Senior Answer (8–9 Years) | Complete interview-ready response — practice this |
| Important Points | Terms, order, and traps — do not skip in the interview |
| Likely Follow-Up Questions | Cross-questions interviewers ask next |
Customize answers with HHS / AKS / 30+ microservices where relevant. Do not claim tools you have not used.
See Follow-Up Question Bank at the bottom of this doc.
| Item | Your Details |
|---|---|
| Name | X |
| Interview target | Senior Azure DevOps / Platform Engineer (8–9 yr depth) |
| Real experience | ~6 years hands-on — frame as senior-scope ownership |
| Cloud | Azure (primary) + AWS |
| Certs | AZ-900, AWS Solutions Architect |
| Recent project | HHS — healthcare, US state gov (FL, IA, SD), 30+ microservices on AKS |
| Core work | Platform ownership: AKS, Terraform, Azure DevOps, DR, security, FinOps |
| Company | HHS |
| Job title | DevOps / Cloud Engineer (platform owner on gov healthcare) |
- Practice aloud — aim for 60–90 seconds per intro question, 2–3 minutes for architecture questions.
- Use STAR (Situation, Task, Action, Result) for behavioral and incident questions.
- Mock interview progress is tracked below — one question at a time.
| # | Question | Status |
|---|---|---|
| 1 | Introduce yourself | ✅ |
| 2 | Years in Azure & DevOps | ✅ |
| 3 | Recent project architecture | ✅ |
| 4 | Azure services worked on | ✅ |
| 5 | VMs, Storage, Backup | ✅ |
| 6 | DR & Azure Backup | ✅ |
| 7 | AKS clusters | ✅ |
| 8 | Pods, Deployments, Services | ✅ |
| 9 | CI/CD tools | ✅ |
| 10 | Dev to Prod deployment | ✅ |
| 11 | VNets and NSGs | ✅ |
| 12 | Public vs Private Endpoints | ✅ |
| 13 | Monitoring tools | ✅ |
| 14 | Troubleshoot production | ✅ |
| 15 | Databases | ✅ |
| 16 | DBA activities | ✅ |
| 17 | Production incidents | ✅ |
| 18 | Client coordination | ✅ |
| 19 | Azure environment architecture | ✅ |
| 20 | Unreachable VM troubleshooting | ✅ |
| 21 | DR testing in Azure | ✅ |
| 22 | Complete AKS architecture | ✅ |
| 23 | AKS cluster upgrades | ✅ |
| 24 | CrashLoopBackOff troubleshooting | ✅ |
| 25 | HPA and Cluster Autoscaler | ✅ |
| 26 | Slow AKS app troubleshooting | ✅ |
| 27 | VNet/NSG/Route Tables | ✅ |
| 28 | ExpressRoute and VPN | ✅ Passed |
| 29 | Azure DevOps pipeline architecture | ✅ Passed |
| 30 | Blue-Green and Canary | ✅ Passed |
| 31 | Production rollbacks | ✅ Passed |
| 32 | Monitoring tools integrated | ✅ Passed |
| 33 | Production alerts configured | ✅ Passed |
| 34 | MongoDB performance monitoring | ✅ Passed |
| 35 | Redis replication and failover | ✅ Passed |
| 36 | GCP Cloud Run / BigQuery / Airflow | ⏳ (repeat at end) |
| 37 | Multi-cloud monitoring (Azure + GCP) | ✅ Passed |
| 38 | Security hardening in Azure | ✅ Passed |
| 39 | VA/PT findings | ✅ Passed |
| 40 | Secure AKS clusters | ✅ Passed |
| 41 | Cloud cost leakages | ✅ Passed |
| 42 | Cost optimization initiatives | ⏭️ Skipped (same as Q41) |
| 43 | Reserved Instances & Savings Plans | ✅ Passed |
| 44 | Storage & idle resource optimization | ✅ Passed |
| 45 | Database response time troubleshooting | ✅ Passed |
| 46 | Prioritize multiple production issues | ✅ Passed |
| 47 | Critical production issue (STAR) | ⏳ |
Document format: Each question has a Senior Answer, Important Points, and Likely Follow-Ups. No raw mock transcripts — customize with your HHS experience when practicing.
"Good morning. I'm X, a DevOps Engineer with six years of experience building secure and scalable infrastructure on Azure and AWS, with my primary focus on Microsoft Azure.
In my role at HHS, I worked on a healthcare platform for US state government clients — Florida, Iowa, and South Dakota. I owned end-to-end infrastructure: provisioning cloud resources, deploying applications on AKS, and day-to-day Kubernetes cluster operations — including upgrades, monitoring, and troubleshooting.
My responsibilities fall into four areas:
- Platform & Kubernetes: AKS cluster maintenance, node pool updates, ingress, and workload deployments
- CI/CD & Automation: Pipelines for 30+ microservices in Azure DevOps and Jenkins — including automated tasks like restoring non-production Postgres data from production to dev/stage environments on demand
- Operations: Incident response, cluster health monitoring, and fast troubleshooting during production issues
- Security & Compliance: Government healthcare compliance requirements and secure pipeline practices
I hold AZ-900 and AWS Solutions Architect certifications. I'm looking to bring my hands-on AKS and Azure DevOps experience into a role where I can own platform reliability at scale.
Thank you."
"I have six years of experience in DevOps and cloud engineering, working across both Azure and AWS.
Azure: Approximately four years of hands-on experience — including VMs, VNets, NSGs, Application Gateway, AKS, ACR, and Azure DevOps pipelines. At HHS, my recent production work has been primarily on Azure — deploying and operating 30+ microservices on AKS using Azure DevOps for CI/CD.
AWS: I also have solid AWS exposure from earlier projects — EC2, S3, IAM, and pipeline integrations — which helps when working in multi-cloud environments.
My growth path:
- Early years: Git, Jenkins, basic cloud VMs, CI/CD fundamentals
- Mid years: Infrastructure as Code with Terraform, containerization with Docker
- Recent years: Deep Azure focus — AKS cluster operations, Azure DevOps repos and pipelines, RBAC for team access, and government healthcare compliance
So while I'm comfortable on both clouds, my deepest recent production experience is on Azure — especially AKS, Azure DevOps, and day-to-day platform operations. I'm fully comfortable owning infrastructure through deployment and production support."
"My recent project at HHS is a healthcare platform for US state governments — Florida, Iowa, South Dakota — running 30+ microservices on Azure.
1. Infrastructure (IaC): Everything is provisioned with Terraform — VNet, subnets, AKS, ACR, Application Gateway, Front Door, Key Vault, and supporting resources.
2. Networking: We use a VNet with private subnets. AKS runs on private nodes — no direct public exposure. Azure Front Door is the global entry point, then traffic flows to Application Gateway (WAF), which routes to the cluster via the Ingress Controller (AGIC).
3. Traffic flow:
User (Browser) → Azure DNS → Azure Front Door (CDN/WAF/global routing) → Application Gateway (WAF, SSL termination) → Ingress Controller (inside AKS) → Kubernetes Services → Pods (30+ microservices) → PostgreSQL (operational DB)4. CI/CD (Azure DevOps): On pipeline trigger → code passes DevSecOps checks → Docker image built and pushed to ACR → deployed to AKS via Helm/kubectl. Secrets come from Azure Key Vault using service connections and Managed Identity — never stored in code.
5. Monitoring: Prometheus + Grafana inside the cluster for metrics; logs and operational DB monitoring for health and troubleshooting.
6. DR/Backup: VM snapshots, Azure Backup, and Site Recovery for critical components.
Environments: Dev → Stage → Production with separate namespaces/subscriptions and promotion via Azure DevOps pipelines."
"At HHS, I've worked hands-on across the full Azure stack for a healthcare platform with 30+ microservices:
Compute & containers: AKS (private clusters, node pools, upgrades), Azure VMs (early phase + GitLab), Azure Functions (VM auto-shutdown for cost)
Networking: VNet, subnets, NSGs, route tables, Azure Front Door, Application Gateway (WAF), AGIC, Private Endpoints, Private DNS zones, Azure Firewall, Azure Bastion
DevOps & registry: Azure DevOps (CI/CD), ACR, Argo CD (GitOps), Terraform (IaC)
Security & identity: Key Vault, Microsoft Entra ID, Managed Identity, Workload Identity Federation, Azure RBAC, Azure Policy, Defender for Cloud
Monitoring: Azure Monitor, Log Analytics, Container Insights, Application Insights — plus Prometheus/Grafana/OpenTelemetry on AKS
Storage & backup: Blob Storage, managed disks, Azure Backup, Recovery Services Vault, Azure Site Recovery
Data services: Azure Database for PostgreSQL, MongoDB Atlas, Azure Cache for Redis Premium
I provision and operate most of these via Terraform — not ad-hoc portal clicks."
- Group by category (compute, network, security, data) — shows breadth
- Lead with AKS — your core strength
- Mention Terraform — shows IaC discipline
- Which service do you know deepest? (AKS + networking)
- How do Private Endpoints work with AKS?
- Difference between Front Door and Application Gateway?
"Yes, I have managed all three extensively — especially in our early VM-based phase before we moved to AKS.
Azure VMs:
- Hosted applications directly on VMs and self-hosted GitLab on a dedicated VM
- Ran Docker workloads on VMs before containerizing to AKS
- Built Azure Functions to automatically shut down dev/test VMs during off-peak hours (nights/weekends) for cost savings
- Connected via Azure Bastion — no public RDP/SSH
Backup & Snapshots:
- Configured daily VM backups via Azure Backup — e.g., GitLab VM backed up daily after business hours
- Took disk snapshots before changes and for restore drills
- Stored backups in Recovery Services Vault with defined retention
Storage Accounts:
- Used Azure Blob Storage for application data and file storage
- Stored Terraform remote state in Azure Blob — team-shared, version-controlled infrastructure
- Configured appropriate access with RBAC and private access where needed
So yes — full lifecycle: provision VMs → backup/snapshot → cost optimize → migrate to AKS when ready."
"Yes — I've configured both Azure Backup and Azure Site Recovery (ASR), plus AKS-specific backup with Velero.
Azure Backup (VMs & Disks):
- Point-in-time restore with scheduled backup policies
- Retention: typically 30–90 days based on compliance and RPO requirements
- Frequency: weekly by default; daily or hourly for high-change or critical workloads
- Stored in Recovery Services Vault (Microsoft-managed — not a Storage Account we pick)
- Used for VMs (e.g., self-hosted GitLab) and managed disk backups for StatefulSet PVs
Azure Site Recovery (DR):
- Replicates VMs from primary region to secondary region
- On disaster → test failover or committed failover → services continue in DR region
- Combined with Azure Front Door / DNS cutover for traffic routing
AKS Backup (Velero):
- Namespace-level incremental backups — twice weekly
- Backup storage in Azure Blob Storage
- Can restore namespaces to same or another region
- PV/PVC (StatefulSets): disk-level backup via Azure Backup on underlying managed disks
Key Vault: Yes — Key Vault supports backup and restore of secrets, keys, and certificates (
az keyvault backup). We also enable soft delete and purge protection for compliance.Full DR strategy: Terraform recreates infrastructure in DR region → restore Velero (K8s resources) from Blob → restore disks/VMs from Azure Backup → ASR failover for replicated VMs → DNS/Front Door cutover."
"Yes — AKS is my primary platform at HHS. I own full cluster lifecycle for 30+ microservices across dev, stage, and production.
Provisioning: Clusters provisioned via Terraform — private cluster, VNet-integrated, Azure CNI, separate system and user node pools.
Day-to-day operations:
- Cluster and node pool upgrades (Kubernetes version, node image)
- Node pool management — standard pools for apps, Spot pools for CI/batch
- Ingress via AGIC + Application Gateway + Front Door
- Add-ons: Prometheus/Grafana, Velero, Key Vault CSI, OpenTelemetry collector
Security & identity:
- Private cluster — API server not public
- Workload Identity Federation + Managed Identity for ACR pull and Key Vault access
- Network policies, Azure Policy add-on, Pod Security Standards
Deployments: Helm via Azure DevOps pipelines + Argo CD GitOps sync
Backup: Velero namespace backups to Blob; disk-level backup for StatefulSet PVs
Troubleshooting: CrashLoopBackOff, ImagePullBackOff, HPA/Cluster Autoscaler tuning, performance issues — daily operational work."
- Private cluster + Terraform = senior signals
- Separate system vs user node pools
- Identity: Workload Identity not service principal secrets
- How do you upgrade AKS without downtime?
- System vs user node pool — why separate?
- How do pods pull images from ACR without credentials?
Pod: Smallest deployable unit in Kubernetes. One or more containers sharing network and storage. Ephemeral — if it dies, Kubernetes creates a new one.
Deployment: Controller that manages pod replicas, rolling updates, and rollbacks. You declare desired state (e.g., 3 replicas); Deployment keeps it.
Service: Stable network endpoint for pods (which have changing IPs). Routes traffic via label selectors. Types:
- ClusterIP — internal only (most common)
- NodePort — exposes on node port
- LoadBalancer — external access via cloud LB (e.g., Azure Load Balancer)
Namespace: Logical isolation within a cluster — separate dev/staging/prod, RBAC, quotas, and network policies.
One-liner: Pod runs containers → Deployment manages pods → Service gives stable address → Namespace isolates workloads.
Bonus (if asked): StatefulSet for stateful apps like databases — stable pod names (pod-0, pod-1), ordered deployment, and PersistentVolumes for storage."
"My primary CI/CD platform is Azure DevOps — pipelines for 30+ microservices at HHS. I also have experience with Jenkins (OWASP ZAP DAST integration) and Argo CD for GitOps.
Azure DevOps (main):
- CI: Build → test → GitLeaks (secrets) → Trivy (vulnerabilities) → SonarQube (quality gate) → Docker build → push to ACR
- CD: Promote same image tag (buildId + git SHA) through Dev → Stage → Prod with manual approval on production
- Deploy to AKS via Helm; service connections use Workload Identity Federation
Argo CD (GitOps):
- Watches Git repo for manifest/Helm value changes → auto-syncs to cluster
- Separates what runs (Git) from how it's built (pipeline)
Terraform (IaC pipeline):
- Separate pipeline —
terraform planon PR,terraform applyon merge- Provisions VNet, AKS, Key Vault, App Gateway, etc.
Jenkins: Used for application DAST scanning with OWASP ZAP and reporting to dev team.
Image tagging rule: Never
latestin production — immutable tags by git SHA."
| Tool | Role |
|---|---|
| Azure DevOps | CI/CD — build, scan, deploy |
| Argo CD | GitOps — cluster sync from Git |
| Terraform | IaC — infrastructure pipeline |
| Jenkins | DAST / legacy pipelines |
- Helm vs Argo CD — when use which?
- What happens if SonarQube gate fails?
- How do you secure pipeline service connections?
"We follow a promotion-based pipeline across Dev → Stage → Production with separate namespaces and environment-specific configuration.
1. Developer flow:
- Feature branch → PR → merge to
develop- CI pipeline triggers automatically
2. CI (build once, deploy many):
- Checkout → GitLeaks → Trivy filesystem scan → build → unit tests → SonarQube quality gate
- Docker build → Trivy image scan → push to ACR with tag
buildId-gitSHA- Same image is promoted — never rebuild per environment
3. CD — environment promotion:
- Dev: Auto-deploy on CI success — smoke test
- Stage/QA: Integration tests, regression, performance checks
- Production: Manual approval gate → deploy via Helm to AKS
- Argo CD syncs Git-managed manifests where GitOps is enabled
4. Configuration per environment:
- Helm values / ConfigMaps differ per env (replicas, endpoints, feature flags)
- Secrets from Key Vault via CSI driver — never in Git or pipeline YAML
5. Post-deploy verification:
- Health probes pass, Grafana dashboards clean, smoke test
- Rollback ready:
helm rollbackor redeploy last-known-good ACR tag6. Infrastructure changes:
- Terraform in separate pipeline — plan on PR, apply on merge — never manual portal changes
Gov healthcare: Change windows, client notification for prod releases, audit trail in Azure DevOps."
- Build once, promote same tag — critical senior principle
- Manual approval for production
- Secrets from Key Vault, not pipeline variables
- Terraform separate from app CD
- How do you handle database migrations across environments?
- What if Stage passes but Prod fails?
- How do you rollback production?
VNet (Virtual Network): A private, logically isolated network in Azure. You define an IP address range (CIDR) — e.g.,
10.0.0.0/16— and divide it into subnets. Resources like AKS nodes, VMs, and Application Gateway get private IPs from these subnets. VNets can peer with other VNets and connect on-premises via VPN/ExpressRoute.NSG (Network Security Group): A firewall rule set that controls inbound and outbound traffic. Attached to a subnet or network interface (NIC). Rules define source/destination IP, port, protocol, and allow/deny — evaluated by priority. Example: allow 443 from App Gateway subnet; deny all other inbound.
Together: VNet = network layout; NSG = traffic filtering. In our HHS project, AKS runs in a private subnet with NSGs controlling what traffic can reach the cluster.
"Public Endpoint: The Azure PaaS service is reachable over the public internet via a public hostname/IP — e.g.,
myaccount.blob.core.windows.net. Traffic traverses the internet. You rely on firewall rules, SAS tokens, or service-level auth.Private Endpoint: A private IP from your VNet subnet is assigned to the PaaS service via Azure Private Link. Traffic stays on the Microsoft backbone — never exposed to the public internet. DNS resolves to the private IP via a Private DNS zone.
When we use Private Endpoints (production at HHS):
- ACR — pods pull images without public registry access
- Key Vault — secrets accessed only from VNet
- Blob Storage, PostgreSQL, Redis Premium — no public exposure
Key differences:
Public Endpoint Private Endpoint Network path Internet Microsoft backbone → VNet IP Public Azure IP Private IP in your subnet DNS Public FQDN Private DNS zone required Security Firewall + auth rules Network isolation by default Use case Dev/test, public APIs Production, compliance (healthcare) Service Endpoint vs Private Endpoint: Service Endpoint routes VNet traffic to Azure service but the service still has a public endpoint. Private Endpoint is stronger isolation — dedicated private IP per service.
At HHS, all production PaaS uses Private Endpoints — required for government healthcare compliance."
- Private Endpoint = Private Link + Private DNS zone
- Production healthcare → Private Endpoints on all PaaS
- Don't confuse Service Endpoint with Private Endpoint
- How does AKS pod resolve Key Vault DNS with Private Endpoint?
- Can you use both public and private on same storage account?
- What is Private Link?
"We use a three-pillar observability stack on Azure — metrics, logs, and traces:
Metrics — Prometheus + Grafana:
- Prometheus scrapes AKS pod/node metrics (CPU, memory, restarts, HTTP rates)
- Grafana dashboards per team/service with alert rules — e.g., pod restart count, latency P95, node disk pressure
- Alerts route to email/Teams; on-call checks dashboard and logs
Logs — Elasticsearch + Loki:
- Loki for Kubernetes pod logs; Elasticsearch for search and log parsing
- Track application events, errors, and audit trails for gov compliance
Traces — OpenTelemetry:
- OTel SDK in microservices exports traces end-to-end
- Identify latency bottlenecks across service dependencies (API → DB → cache)
Azure-native (platform layer):
- Azure Monitor + Log Analytics for platform logs and KQL queries
- Container Insights for AKS node/pod health
- Application Insights for APM where enabled
- Azure Service Health — rule out regional outages first
At HHS: On alert fire → check Grafana → drill Loki/Elasticsearch logs → OTel trace for latency →
kubectlif K8s issue. MTTR improved by correlating all three pillars.Innovation (if asked): We piloted a multi-agent LLM integration (Gemini/OpenAI API) that sends alert context + logs to the model and returns RCA hints and remediation steps — assists engineers but does not replace human approval for prod changes."
| Tool | Your use |
|---|---|
| Prometheus | Metrics collection |
| Grafana | Dashboards + alerts |
| OpenTelemetry | Distributed traces, latency |
| LLM multi-agent | RCA + remediation advice from logs |
"Production issues are alert-driven. Grafana/Azure Monitor alerts fire for pod problems — CrashLoopBackOff, ImagePullBackOff, Evicted, Pending — or application-down alerts.
My 6-step process:
- Triage — Check priority (P0/P1/P2), blast radius, how many users/states affected
- Correlate — Azure Service Health, recent deployments, config/Key Vault changes
- Observe — Grafana dashboards → Loki/Log Analytics logs → OpenTelemetry traces
- Isolate —
kubectl describe pod,kubectl get events,kubectl logs --previous- Mitigate first — rollback Helm release, fix image tag, scale nodes — restore service before root cause
- Document — Update ticket with RCA, close after verification, post-mortem for P0/P1
Example — ImagePullBackOff on AKS:
kubectl describe pod→ check Events section- Common causes: (1) wrong image tag not in ACR (2) AcrPull role missing on Managed Identity (3) auth to private ACR failed (4) PE/DNS issue
- Fix: verify tag in ACR, check
AcrPullRBAC, verify Workload Identity / service account, re-deployExample — App-to-app connectivity failure:
kubectl exec→ ping service DNS name (not IP — IPs change)- If network OK → check environment variables, ConfigMaps, Secrets from Key Vault CSI
- Check namespace ResourceQuota — pod may be Pending if quota exceeded
For gov healthcare: Communicate status to client on P0/P1 every 15–30 min until resolved."
Companies vary slightly — this is the standard enterprise model interviewers expect:
| Priority | Also called | Meaning | Response time | Who acts | Example (HHS healthcare) |
|---|---|---|---|---|---|
| P0 | Sev-1 / Critical | Complete outage — production down, no workaround | Immediate (< 15 min) | All hands, incident commander, war room | All state portals down, PHI system unreachable |
| P1 | Sev-2 / High | Major degradation — core feature broken, large user impact | < 30–60 min | On-call + platform lead + dev lead | One state can't submit claims; 50% error rate |
| P2 | Sev-3 / Medium | Partial impact — workaround exists, limited users | < 4 hours | On-call engineer | Single microservice down, other states OK |
| P3 | Sev-4 / Low | Minor — cosmetic, dev/test only, no prod impact | Next business day | Assign to backlog | Grafana dashboard broken, non-prod pod crash |
"I've worked with several databases across our projects:
Database Use case Environment pattern MongoDB Primary app datastore (HHS) Self-hosted in Dev/QA (cost savings); MongoDB Atlas (or managed) in Production PostgreSQL Operational/relational data Azure Database for PostgreSQL Flexible Server — used for prod→stage data restore pipelines Azure SQL Microsoft SQL workloads Transactional apps, HA with zone redundancy Cosmos DB NoSQL / globally distributed Low-latency, multi-region scenarios Environment strategy: Lower environments use lighter/cheaper DB (self-hosted MongoDB on AKS/VM) — production uses fully managed PaaS with HA. We don't replicate prod infrastructure in Dev/QA for cost optimization.
HA & DR (platform layer — my responsibility):
- Enable zone-redundant HA on Azure PostgreSQL/SQL where required
- Geo-redundant backup and point-in-time restore (PITR) on PaaS databases
- Scheduled logical backups (pg_dump / mongodump) to Blob Storage for extra protection and cross-region recovery
- Private Endpoints for DB connectivity from AKS — no public exposure
I'm not a full-time DBA — I own provisioning, connectivity, backup, HA config, and restore pipelines; DBAs/dev teams own query tuning."
"We had a dedicated DBA team, but as a platform engineer I owned the infrastructure and operational side of databases:
Provisioning & configuration:
- Created and configured PostgreSQL Flexible Server, Cosmos DB, and MySQL via Terraform
- Sized instances based on workload — vCores, storage, HA mode (zone-redundant)
- Enabled Private Endpoints, disabled public access, configured Private DNS Zones
Connectivity & identity:
- Troubleshot connection issues between AKS pods and databases
- Set up Managed Identity and service accounts for apps to authenticate without passwords in code
- Stored connection strings in Key Vault; apps pull via Key Vault CSI driver
Backup & environment sync:
- Built Azure DevOps pipelines to dump non-critical prod data and restore to Dev/Stage — so developers work with realistic data while masking sensitive fields
- Enabled native PITR on PaaS DBs; additional pg_dump/mongodump to Blob for pipeline-driven restores
Performance & cost (with DBA team):
- Monitored vCore/DTU utilization, connection count, slow query logs in Azure Portal
- Right-sized instances based on metrics — scale up/down for cost optimization
- Escalated query tuning and indexing to DBAs — my role stops at platform layer
Boundary: I own infra, access, backup, connectivity, sizing — DBAs own schema, queries, indexes."
"Yes — regularly. At HHS, I was on-call for a healthcare platform serving US state governments. Incidents are alert-driven and prioritized P0/P1/P2/P3.
My incident process:
- Triage — severity, blast radius, users affected
- War room — Teams channel for P0/P1, assign incident commander
- Mitigate first — rollback, scale, fix config — restore service before deep RCA
- Investigate — Grafana → logs → OTel traces →
kubectl describe/logs- Communicate — client status updates every 15–30 min on P0/P1
- Post-mortem — RCA document, action items, alert tuning
Real example — ImagePullBackOff (P1):
- Situation: After Terraform apply, multiple pods entered ImagePullBackOff — microservices couldn't pull from ACR
- Action:
kubectl describe pod→ missing AcrPull role on Managed Identity after identity recreation → fixed RBAC in Terraform → redeployed- Result: Restored in ~10 minutes; added validation for MI permissions post-Terraform changes
Other incidents handled: CrashLoopBackOff (bad config/secret), pod eviction (resource pressure), connectivity failures after Key Vault rotation, slow AKS apps (DB/ingress bottlenecks).
Follow-the-sun: Onshore US + offshore India — handoff notes in ticket, clear ownership per incident."
- Mitigate first, RCA later — senior signal
- Use real STAR example (ImagePullBackOff)
- Gov healthcare → client communication on P0/P1
- Walk me through your ImagePullBackOff incident (STAR)
- P1 vs P2 — how do you decide?
- What goes in a post-mortem?
"Yes — extensively. At HHS, we support US state government clients (Florida, Iowa, South Dakota) with a follow-the-sun model — onshore US team + offshore India team.
Cross-functional coordination:
- Onshore (US): client-facing, business context, production access, stakeholder updates
- Offshore (India): platform engineering, AKS ops, CI/CD, Terraform, incident troubleshooting
- Developers: bug identification, app logs, env var / config fixes
- DBA team: database performance, schema, data masking
- Security: VA/PT remediation, compliance for gov healthcare
How we work:
- Join Teams war rooms for P0/P1 — even if not our component, we contribute until resolved
- Status updates every 15 minutes during critical incidents
- I have led incident bridge calls — coordinate onshore + offshore + dev
- For new infrastructure or cost optimization — we propose, onshore validates with client
- Release windows coordinated across time zones — US team approves prod changes
Example: During ImagePullBackOff P1, I led troubleshooting on the bridge call — offshore fixed Terraform RBAC while onshore updated the client every 15 min until resolved in 10 minutes.
Strong rapport built over years — clear communication is as important as technical skills in gov healthcare."
Overview: HHS healthcare platform — 30+ microservices on private AKS, provisioned via Terraform, US state government clients.
Traffic flow:
User (Browser → URL) → Azure DNS → Azure Front Door (global routing to nearest healthy region) → Application Gateway (WAF + SSL) → AGIC (Application Gateway Ingress Controller) → Ingress → Services → Pods → Databases (PostgreSQL / MongoDB)Networking: Everything in a private VNet with dedicated subnets — AKS nodes, App Gateway, Private Endpoints in separate subnets. No public exposure on AKS. Azure Bastion for secure engineer access (no public SSH/RDP).
Integrations: ACR + Key Vault via Managed Identity and Workload Identity Federation. Private Endpoints on PaaS services. Azure DevOps CI/CD → ACR → Helm → AKS.
Observability: Prometheus, Grafana, OpenTelemetry, Loki/Elasticsearch.
Hub-spoke: Our current project uses a single VNet architecture (not hub-spoke). I understand hub-spoke conceptually — hub centralizes firewall/VPN/Bastion; spokes connect via VNet peering; on-premises via ExpressRoute/VPN to hub. Ready to implement if required."
"I follow a layered checklist — platform first, then network, then OS:
1. Azure platform (Portal):
- VM power state — running or stopped?
- Resource Health + Azure Service Health — region outage?
- Activity Log — recent NSG change, patch, reboot?
2. DNS (if using hostname):
nslookup— does FQDN resolve to correct public IP?- Verify Azure DNS A-record matches assigned IP
3. Networking:
- NSG rules — inbound 22/3389 or app port allowed?
- ASG (Application Security Group) rules if used
- Azure Firewall rules if traffic routes through firewall
- Effective routes — traffic going where expected?
- Public IP assigned and associated to NIC?
- Network Watcher → Connection Troubleshoot / IP Flow Verify
4. Access VM (when network looks OK):
- Azure Bastion — connect without public IP
- Serial Console + Boot Diagnostics — check boot logs if SSH/RDP fails (disk full, fstab error, kernel panic)
- VMAccess extension — reset SSH key or local admin password
5. Metrics (Azure Monitor):
- CPU, memory, disk space (>90% can cause failures), network I/O
- High CPU or full disk often explains "unreachable" symptoms
6. Recovery:
- Restore from Azure Backup snapshot
- Attach OS disk to rescue VM if corrupted
- Redeploy VM keeping data disk
Root cause examples I've seen: NSG rule change, disk full, failed patch, wrong DNS mapping, expired SSH key."
"DR testing validates our RTO/RPO commitments — we run both tabletop exercises and technical failover tests on a scheduled cadence (quarterly/annually per compliance).
1. Plan & scope:
- Define test scenario — regional outage, datacenter failure, ransomware recovery
- Document expected RTO (how fast we recover) and RPO (how much data we can lose)
- Notify stakeholders — test window, expected impact (ideally non-prod first)
2. Azure Site Recovery (ASR) — VM DR test:
- Run Test Failover (non-destructive) — VMs spin up in DR region on isolated network
- Validate application starts, connectivity, data consistency
- Cleanup test failover — does not affect production replication
- Document actual RTO vs target
3. Azure Backup restore test:
- Restore a VM or managed disk from Recovery Services Vault to isolated resource group
- Verify data integrity and restore time
- Test Key Vault backup/restore for secrets recovery
4. AKS DR test (Velero):
- Restore namespace from Velero backup in Blob to same or DR cluster
- Validate pods, services, ingress come up
- Test StatefulSet PV restore from disk-level Azure Backup
5. Infrastructure DR test (Terraform):
terraform applyin DR region from version-controlled code- Validates IaC can recreate full environment
6. Traffic cutover test:
- Front Door / DNS failover to DR region endpoint
- Validate end-to-end user flow
7. Post-test:
- Document results, gaps, action items
- Update runbooks with actual timings
- Report to compliance/client for gov healthcare audit trail
At HHS: Combined ASR test failover + Velero namespace restore + Terraform DR apply — full stack validation."
| Test type | Tool | Non-destructive? |
|---|---|---|
| VM failover | ASR Test Failover | Yes — isolated network |
| K8s restore | Velero | Yes — restore to test cluster |
| Disk/VM restore | Azure Backup | Yes — isolated RG |
| Infra recreate | Terraform | Yes — DR region/subscription |
- Difference between ASR test failover and committed failover?
- What is RTO vs RPO?
- How do you DR test AKS without affecting production?
1. Control Plane (Microsoft-managed — free in standard tier):
- API Server — all
kubectl/Helm commands go here; stores cluster state- etcd — distributed key-value store for cluster state
- Scheduler — assigns pods to nodes
- Controller Manager — maintains desired state (replicas, endpoints)
- We never manage these — Azure handles HA, patching, upgrades
2. Node Pools (we manage — VM Scale Sets):
- System node pool — runs
kube-systempods (CoreDNS, metrics-server, monitoring agents). Keep stable; don't run app workloads here- User node pools — segregated by workload:
- Normal workloads — standard VM SKUs
- Spot node pool — cheaper, fault-tolerant batch/CI jobs
- GPU node pool — ML/GPU workloads (if needed)
- Each node runs: kubelet, kube-proxy, containerd (container runtime)
3. DaemonSets on every node:
- CNI plugin (Azure CNI) — assigns pod IPs from VNet subnet; VNet integration
- CSI drivers — connect pods to Azure services:
- Disk CSI → Azure Managed Disks (PVs)
- Key Vault CSI → secrets as volumes
- ACR pull via Managed Identity (not CSI but identity layer)
4. Networking & Ingress:
- Private AKS in VNet subnets — no public API or nodes
- AGIC — Application Gateway Ingress Controller routes external traffic to pods
- Network policies — restrict pod-to-pod traffic
5. Identity & integrations:
- Managed Identity on cluster + Workload Identity Federation for pods
- ACR image pull, Key Vault secrets — no credentials in code
6. Add-ons we run: Prometheus/Grafana, Velero, AGIC, Key Vault CSI, OpenTelemetry collector"
Pre-upgrade (planning):
- Review Kubernetes release notes — deprecated APIs, breaking changes
- Check compatibility: Helm charts, Argo CD, ingress (AGIC), monitoring addons
- Run
kubectl convert/ API deprecation checks on manifests- Confirm application runtimes support target K8s version — coordinate with developers
- Upgrade non-prod first; wait 1–2 weeks before production
Backup before upgrade:
- Velero — namespace-level backup to Blob Storage
- Disk snapshots — PVs / StatefulSets
- Terraform (IaC) ready — can spin up new cluster in same or DR region if catastrophic failure
- Drift detection pipeline — runs frequently; if infra drift detected, reconcile Terraform code before upgrade
Upgrade execution (order matters):
- Control plane first —
az aks upgrade --kubernetes-version x.y.z— API server briefly unavailable- Node pools one by one — never all at once
- Check PodDisruptionBudgets (PDB) and replica counts before drain
- Use surge settings (
maxSurge: 33%) — cordon → drain → new nodes → verify pods healthy- Only proceed to next node pool when all pods are running
- Add-ons / plugins last — AGIC, Key Vault CSI, Prometheus, Velero — upgrade to compatible versions
Post-upgrade validation:
kubectl get nodes— all Ready on new version- Smoke tests, Grafana dashboards clean, no CrashLoopBackOff
- Rollback plan: new node pool on old K8s version + migrate workloads if needed
Azure supports last 3 K8s versions — upgrade within support window before version goes end-of-life."
Alert fires in Grafana — pod restart count or CrashLoopBackOff state.
Step 1 — Describe pod (always first):
kubectl describe pod <pod-name> -n <namespace> kubectl get events -n <namespace> --sort-by='.lastTimestamp'Check Events section — exit code, OOMKilled, probe failure, CreateContainerConfigError
Step 2 — Previous container logs (if describe unclear):
kubectl logs <pod-name> -n <namespace> --previous
--previousis critical — current container may not have logs yet; previous shows why last crash happenedStep 3 — Identify root cause and fix:
| Cause | What to check | Fix |
|---|---|---|
| App crash | --previous logs, stack trace |
Fix code/config, rollback Helm release |
| Missing env vars / secrets | describe → CreateContainerConfigError | ConfigMap, Secret, Key Vault CSI mount |
| OOMKilled | describe → OOMKilled, kubectl top pod |
Increase memory limits, fix memory leak |
| Liveness probe too aggressive | describe → probe failures | Increase initialDelaySeconds, failureThreshold |
| Wrong image / tag | describe → ErrImagePull (different from CrashLoop) | Fix image tag in deployment |
| Init container failure | describe → init container status | kubectl logs <pod> -c <init-container> |
| Permission / securityContext | describe → permission denied | Fix runAsUser, fsGroup, volume mounts |
Step 4 — Mitigate:
helm rollbackor fix manifest → redeploy → verify pod Running in GrafanaReal example (Q17): ImagePullBackOff was auth issue — different state but same describe/logs workflow."
HPA (Horizontal Pod Autoscaler):
- Scales number of pods in a Deployment/StatefulSet horizontally
- Metrics Server (or Prometheus adapter) collects CPU/memory metrics
- HPA compares average metric vs target threshold — e.g., CPU > 70%
- Scales up (add pods) or down (remove pods) automatically
- Scope: single workload (one Deployment)
Cluster Autoscaler:
- Scales number of nodes in a node pool
- Triggers when pods are Pending — cannot be scheduled because no node has enough CPU/memory
- Adds nodes to node pool; removes underutilized nodes when pods can be rescheduled elsewhere
- Scope: entire node pool
How they work together:
Traffic spike → HPA adds more pods → Pods go Pending (no node capacity) → Cluster Autoscaler adds nodes → Pods scheduled → service handles loadBonus: KEDA for event-driven scaling — e.g., Azure Service Bus queue depth, HTTP rate — beyond CPU/memory."
- HPA scales pods; Cluster Autoscaler scales nodes — different layers
- HPA uses Metrics Server (CPU/memory) or Prometheus adapter / KEDA for custom metrics
- Cluster Autoscaler triggers on Pending pods (unschedulable), not node CPU utilization graphs
- Flow: traffic ↑ → HPA adds pods → pods Pending → CA adds nodes
- Configure
minReplicas/maxReplicas(HPA) andmin-count/max-count(node pool + CA) - KEDA — event-driven scale (Azure Service Bus queue depth, HTTP rate)
| Follow-up | Answer |
|---|---|
| HPA not scaling? | Metrics Server missing, wrong apiVersion, requests/limits not set on pods |
| CA not adding nodes? | Pod limits, max node count reached, CA not enabled on pool |
| HPA vs VPA? | HPA = horizontal (more pods); VPA = vertical (bigger pod resources) |
1. Triage — define scope:
- All users or one state? All APIs or one endpoint? When did it start? Recent deployment?
2. Observability (your approach — expand):
- Grafana dashboards — latency P95/P99, error rate, pod restart spikes
kubectl top pods— CPU/memory per pod; throttling?kubectl top nodes— node pressure?- Namespace + pod events:
kubectl get events -n <ns> --sort-by='.lastTimestamp'- Pod logs:
kubectl logs <pod> -f— errors, timeouts- OpenTelemetry traces — which dependency is slow? (API → DB → cache)
3. Layer by layer:
| Layer | Check | Fix |
|---|---|---|
| Ingress | App Gateway / Front Door latency, 5xx | WAF rule, backend pool health |
| Pods | CPU/mem high → resource issue | Scale HPA, increase limits, add nodes (Cluster Autoscaler) |
| App | Logs — exceptions, thread pool exhaustion | Rollback deploy, fix code |
| Database | PostgreSQL vCore/connections, slow queries | Scale DB, connection pool, escalate to DBA |
| Cache | Redis hit ratio, evictions | Increase Redis memory |
| Network | Private Endpoint DNS latency | Fix DNS, NSG |
| External API | OTel dependency span slow | Contact vendor, add timeout/circuit breaker |
4. Mitigate first: Scale HPA, add nodes, rollback bad Helm release — restore SLA, RCA after.
5. Your quick path (valid for interview start):
Grafana alert → kubectl top pod → events → logs → CPU/mem high? → HPA scale / increase limits → DB slow in traces? → check PostgreSQL metrics → Recent deploy? → helm rollback ```"
- Start with scope (who, when, which API) and correlate with deploy
- Grafana +
kubectl top+ events + logs + OpenTelemetry traces - Slowness is often database or ingress — not always pod CPU
- Mitigate first: HPA scale, helm rollback, DB scale — then RCA
| Follow-up | Answer |
|---|---|
| How find slow dependency? | OpenTelemetry trace waterfall — longest span |
| CPU normal but slow? | Check DB connections, external APIs, thread pool |
- VNet: Top-level network boundary (e.g.,
10.0.0.0/16)- Subnet: Segments within VNet (e.g.,
10.0.1.0/24for web,10.0.2.0/24for data). Resources get IPs from subnet range.- NSG: Attached to subnet or NIC — filters traffic (allow/deny) based on 5-tuple rules. Evaluated by priority.
- Route Table (UDR): Attached to subnet — controls where traffic is routed (default system routes vs. custom: force tunnel through Firewall, route to NVA, BGP from ExpressRoute).
Flow example: Pod in AKS → subnet route table sends egress to Azure Firewall → NSG on subnet allows outbound 443 → Firewall applies application rules → internet.
Effective routes + effective security rules in Portal show the actual result of all combined rules.
- VNet = address space; Subnet = segment; resources get IPs from subnet
- NSG = allow/deny traffic (L4); attached to subnet or NIC; rules by priority (5-tuple)
- Route Table (UDR) = where packets go; overrides system routes
- NAT Gateway / Firewall = outbound internet for private resources (SNAT / inspection)
- NSG filters — Route Table routes — do not confuse the two
- Evaluate Effective routes + Effective security rules in Portal
Inbound: Browser → DNS → Front Door → App Gateway → AGIC → Ingress → Service → Pod
Outbound: Pod → Route Table → NSG → Firewall/NAT Gateway → Internet (NOT through AGIC)
| Follow-up | Answer |
|---|---|
| NSG vs Firewall? | NSG = distributed L4; Firewall = centralized L3–L7 + threat intel |
| AKS pod egress? | Azure CNI assigns pod IPs — UDR on pod subnet controls egress path |
| Aspect | Site-to-Site VPN | ExpressRoute |
|---|---|---|
| Path | Encrypted over public internet (IPsec/IKE) | Private dedicated connection via connectivity provider |
| Bandwidth | Up to ~1.25 Gbps (VpnGw scale) | 50 Mbps to 100 Gbps |
| Latency | Variable | Consistent, lower |
| Cost | Lower | Higher (provider + port fees) |
| Use case | Branch offices, dev/test, backup | Enterprise production, compliance, high throughput |
"In hybrid setups I often use ExpressRoute as primary and VPN as backup (ExpressRoute + VPN coexistence). BGP propagates on-premises routes into Azure VNet. For hub-spoke, route tables in spokes point to hub NVA/Firewall for inspection."
- VPN (Site-to-Site): Public internet + IPsec/IKE tunnel → VPN Gateway on VNet
- ExpressRoute: Private dedicated circuit via connectivity provider → ExpressRoute Gateway on VNet
- Do not mix gateways — VPN Gateway ≠ ExpressRoute Gateway
- VPN: lower cost, lower bandwidth (~1.25 Gbps max), good for branch/dev/failover
- ExpressRoute: higher cost, 50 Mbps–100 Gbps, consistent latency, compliance/production
- Together: ExpressRoute primary + VPN backup if ExpressRoute fails
- BGP exchanges routes between on-prem and Azure automatically
| Follow-up | Answer |
|---|---|
| ExpressRoute vs VPN for healthcare gov? | ExpressRoute for prod/compliance; VPN as backup |
| What is BGP? | Dynamic route exchange — on-prem networks advertised to Azure and vice versa |
Our Azure DevOps architecture has CI and CD separated.
CI: Trigger on merge to
develop→ checkout → GitLeaks → Trivy FS → build → unit tests → SonarQube quality gate → Docker build → Trivy image scan → push to ACR with tag (buildId + git SHA).CD: Promote same image tag through Dev (auto) → QA (tests) → Prod (manual approval). Deploy to AKS via Helm. Service connections use Workload Identity Federation — no secrets in YAML.
Argo CD handles GitOps where Helm values/manifests in Git auto-sync to cluster. Terraform runs in a separate pipeline for infrastructure.
- CI = build, test, scan, publish image | CD = deploy per environment
- Never deploy
latestto Prod — promote same tested ACR tag - GitLeaks (secrets) → Trivy (vulns) → SonarQube (quality gate) = DevSecOps chain
- Helm = pipeline deploys | Argo CD = Git changes auto-sync (GitOps)
- Terraform = separate IaC pipeline (plan on PR, apply on merge)
- Workload Identity Federation for service connections — no PATs/secrets in YAML
| Follow-up | Answer |
|---|---|
| SonarQube gate fails? | Pipeline stops — fix code or get approved exception |
| Helm vs Argo CD? | Helm = pipeline pushes; Argo CD watches Git repo and reconciles cluster state |
| How secure ACR push? | Service connection + Managed Identity / Workload Identity Federation |
Blue-Green: Two identical environments — Blue = live, Green = new version. On deploy, switch all traffic instantly from Blue to Green. If issues → switch back to Blue immediately. Fast rollback, but 2x cost during deploy.
Canary: Same cluster, two versions running. Route traffic gradually by percentage (5% → 25% → 50% → 100%) to the new version. Monitor metrics/errors. If problems → stop canary (only small % affected). On AKS: Ingress weights, Flagger, or Argo Rollouts.
Rolling update (AKS default): Replace pods incrementally. Control with
maxSurge/maxUnavailable. PDB ensures minimum availability. Rollback:kubectl rollout undoorhelm rollback.
- Blue-Green = instant switch (not split traffic)
- Canary = percentage gradual (not two full environments)
- Rolling = default K8s strategy — pod-by-pod replacement
- Blue-Green rollback = switch traffic back | Rolling =
rollout undo| Canary = set canary weight to 0%
Application rollback (AKS):
helm rollback <release> <revision> -n <namespace> # or kubectl rollout undo deployment/<name> -n <namespace>Pipeline rollback: Redeploy last known-good build from ACR (immutable tags by git SHA, never
latestin prod).Database: Backward-compatible migrations only; use expand-contract pattern; rollback app before rolling back schema.
Infrastructure:
terraform applyprevious state from version control; never manual portal changes.Process: Rollback is pre-documented in runbook; triggered when error rate > threshold for 5 min post-deploy; incident channel notified; post-mortem required."
"We run a three-pillar observability stack on AKS — metrics, logs, and traces — integrated through Grafana as the single pane of glass.
Metrics — Prometheus + Grafana:
- Prometheus scrapes AKS node and pod metrics — CPU, memory, restarts, HTTP error rates
- Grafana dashboards per service with alert rules for pod failures — CrashLoopBackOff, ImagePullBackOff, high restart count, node disk pressure
- Alerts route to Teams/email for on-call
Logs — Loki + Promtail:
- Promtail ships pod logs from every node to Loki
- Grafana queries Loki for log drill-down when an alert fires
Traces — OpenTelemetry:
- OTel SDK in microservices exports distributed traces
- Identify latency bottlenecks across dependencies (API → DB → cache)
Azure-native (platform layer):
- Azure Monitor + Log Analytics for platform and audit logs
- Container Insights for AKS cluster health
- Application Insights for APM where enabled
- Azure Service Health — rule out regional outages first
Integration workflow at HHS: Alert fires in Grafana → check dashboard → drill Loki logs → OTel trace for latency →
kubectlif Kubernetes issue. Correlating all three pillars reduced MTTR.CI/CD: Azure DevOps deployment markers correlated with metric spikes post-release."
| Pillar | Tool | Integration |
|---|---|---|
| Metrics | Prometheus → Grafana | Scrape pods/nodes; alert rules → Teams |
| Logs | Promtail → Loki → Grafana | Log drill-down from alert |
| Traces | OpenTelemetry → Grafana/backend | End-to-end latency |
| Platform | Azure Monitor, Container Insights | AKS/node health, KQL queries |
- How do Prometheus and Grafana connect? (ServiceMonitor / scrape config)
- What happens when an alert fires? (dashboard → logs → traces → kubectl)
- Difference between Container Insights and Prometheus?
- How do you avoid alert fatigue?
"In production we configure category-based alerts — each with a linked runbook. We deliberately reduced alert count so every alert is actionable — no noise.
Category Examples Severity Availability CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending, HTTP 5xx rate P1 / Sev 1 Performance P95 latency > 2s, CPU > 85%, memory > 90% P2 / Sev 2 Capacity Disk > 80%, node pool near max, DB connection pool exhaustion P2 Security Key Vault access anomalies, failed login spikes, Defender for Cloud alerts P1–P2 Backup/DR Backup job failed, ASR replication lag P2 Certificate TLS cert expiry < 30 days P2 Cost Budget threshold 80%/100% P3 Alerts route to Teams/on-call. On fire → follow runbook → Grafana → logs → mitigate. We tune thresholds post-incident to keep false positives low."
- Every alert = runbook — no alert without an action
- Availability = P1 — pod failures and 5xx directly impact users
- Backup job failed ≠ CI/CD pipeline failure (say the right one in interview)
- Tune regularly — alert fatigue kills on-call effectiveness
- How do you reduce alert fatigue?
- What's P1 vs P2 in your org?
- Example runbook for CrashLoopBackOff?
"Production uses MongoDB Atlas. We rely on Atlas-native monitoring — Performance Advisor, slow query logs, and Real-Time Performance Panel — plus alerts on connections, replication lag, and opcounters.
Key metrics we watch:
- CPU, memory, disk usage
- Active connections and connection pool exhaustion from apps
- Replication lag (replica set health)
- Slow queries and missing indexes — look for COLLSCAN (collection scans, not table scans — MongoDB uses collections)
- Query execution times and index usage
Integration: MongoDB exporter → Prometheus → Grafana dashboards for unified observability alongside AKS metrics.
Lower environments: Same exporter pipeline; we also use
mongostat/explain()on slow queries for troubleshooting.Red flags: COLLSCAN on large collections, rising connection count, replication lag > threshold, working set exceeding RAM."
| Term | Correct |
|---|---|
| MongoDB data unit | Collections (not tables) |
| Missing index symptom | COLLSCAN in query plan |
| Your stack | Atlas + MongoDB exporter → Prometheus → Grafana |
- What is COLLSCAN and how do you fix it?
- How do you find slow queries in Atlas?
- Connection pool exhaustion — what causes it?
"Redis replication creates an asynchronous copy of the primary for high availability and read scaling. The primary handles writes; replicas sync via
PSYNC.Lower environments — self-managed Redis:
- Master + replica topology with Redis Sentinel
- Sentinel monitors the primary; on failure it elects and promotes a replica, then apps reconnect to the new primary
Production — Azure Cache for Redis Premium:
- Primary-replica replication with automatic failover (Azure-managed on Premium)
- Zone redundancy for HA across availability zones
- Private Endpoint — AKS pods connect over the VNet; Private DNS zone resolves the Redis hostname inside the cluster network
Monitoring:
used_memory,evicted_keys,connected_clients, replication lagApp side: Connection multiplexer with retry/reconnect logic so failover is transparent to the application."
| Environment | Setup | Failover |
|---|---|---|
| Lower env | Self-hosted + Sentinel | Sentinel promotes replica |
| Production | Azure Redis Premium + zone redundancy | Azure automatic failover |
| Network | Private Endpoint + Private DNS zone | Pods in AKS VNet only |
- Sync vs async replication — what can you lose on failover?
- Difference between Sentinel and Redis Cluster?
- How does Private Endpoint DNS work for Redis from AKS?
"My primary cloud is Azure, so most of my production work is AKS, Azure DevOps, and Azure data services.
Cloud Run and BigQuery: I don't have deep hands-on production experience with these. I understand the concepts — Cloud Run is serverless containers (similar to Azure Container Apps), and BigQuery is a serverless analytics warehouse (similar to Azure Synapse / serverless SQL patterns).
Airflow — I have practical experience:
- Used to author, schedule, and monitor workflows
- Workflows are defined as DAGs (Directed Acyclic Graphs) in Python code
- Each DAG has tasks with dependencies — upstream must succeed before downstream runs
- Supports retries, scheduling (cron), and a UI to track run status
- Pipelines are version-controlled in Git — scalable and repeatable for ETL/batch jobs
The orchestration concepts transfer: Airflow DAG patterns map closely to Azure Data Factory pipelines and activity dependencies."
| GCP service | If no hands-on | Azure equivalent |
|---|---|---|
| Cloud Run | Serverless containers | Container Apps |
| BigQuery | Analytics warehouse | Synapse / analytics SQL |
| Airflow | DAG-based orchestration | Data Factory pipeline patterns |
Interview rule: Be honest on gaps; show you know equivalents and own Airflow clearly.
- What is a DAG in Airflow?
- How do task dependencies work?
- Airflow vs Azure Data Factory?
"My production depth is Azure, but the multi-cloud pattern we follow — or would follow — is one observability stack, not two.
Unified OSS stack on both clouds:
- Deploy the same Prometheus + Grafana + Loki (and ELK where needed) on AKS and GKE
- Single pane of glass — Grafana dashboards with mixed data sources from both clouds
Standardized instrumentation:
- OpenTelemetry in all microservices — self-hosted collector or export to Application Insights on Azure; same OTel pattern on GCP for trace correlation across services
Platform layer per cloud:
- Azure: Azure Monitor, Container Insights, Cost Management — resources tagged (
environment,team,cost-center) for segregation- GCP: Cloud Monitoring + Logging forwarded to the same central sink (or scraped into Prometheus)
Operations: One on-call rotation, one runbook, one escalation policy (Teams/PagerDuty) — avoid duplicating alert logic per cloud. Cloud-specific dashboards only where the platform differs.
Cost: Azure Cost Management + GCP Billing export — unified FinOps view via tags."
| Principle | How |
|---|---|
| Same stack both clouds | Prometheus/Grafana/Loki on AKS + GKE |
| Trace correlation | OpenTelemetry everywhere |
| No duplicate alerts | One runbook, one on-call |
| Segregation | Consistent tags on Azure AND GCP |
- How do you correlate traces across Azure and GCP services?
- Grafana mixed data sources — how configured?
- How avoid alert fatigue across two clouds?
"We apply layered security across Azure — identity, network, compute, data, governance, secrets, and CI/CD.
Identity: Enforce MFA, Conditional Access, PIM for admin roles — no standing privileged access, no shared accounts.
Network: WAF on App Gateway/Front Door, Azure Firewall for outbound control, NSGs on subnets, Private Endpoints for PaaS services, DDoS Protection on public-facing resources. No public IPs on production workloads where possible.
Compute / AKS: Private AKS cluster, ACR with trusted images only, Defender for Cloud image scanning, Pod Security Standards, Workload Identity / Managed Identity — no static credentials in pods.
Secrets: Key Vault for secrets, certs, and keys — accessed via Managed Identity, never in code or repos.
Data: Encryption at rest (CMK via Key Vault), TLS 1.2+ enforced, soft delete on storage and Key Vault.
Governance: Azure Policy — deny public storage, require tags, enforce HTTPS, block non-compliant resources at deploy time.
CI/CD (shift-left): GitLeaks, Trivy image scanning, SonarQube in Azure DevOps pipeline before anything reaches prod.
Monitoring: Defender for Cloud secure score, alerts on misconfigurations and anomalies.
Everything is codified in Terraform so security baseline is repeatable and drift is caught early."
| Layer | Your HHS examples |
|---|---|
| Identity | MFA, Conditional Access, PIM |
| Network | WAF, Firewall, NSG, Private Endpoints |
| Compute | Private AKS, ACR, Defender scanning |
| Secrets | Key Vault + Managed Identity |
| Governance | Azure Policy |
| CI/CD | GitLeaks, Trivy, SonarQube |
- What is PIM and why use it?
- Private Endpoint vs Service Endpoint?
- How does Workload Identity work with Key Vault?
"Yes. I participated in vulnerability scanning and penetration testing activities and owned infra-side remediation from scan reports.
Application / image scanning:
- Trivy in CI pipeline flagged CVEs in container images → we updated base images and rebuilt pipelines
- OWASP ZAP integrated in Jenkins pipeline for DAST on applications → shared findings with dev team for code fixes
Azure infrastructure findings and remediation:
- Soft delete not enabled on critical storage/Key Vault → enabled soft delete + purge protection
- NSG ports open to 0.0.0.0/0 → restricted to known IP ranges / Bastion, Azure Policy to deny wide-open rules
- Public endpoints where Private Endpoints were required → migrated PaaS to Private Endpoint + Private DNS
- TLS 1.0/1.1 on App Gateway → enforced TLS 1.2 minimum SSL policy, rescanned clean
- AKS secrets in plain Kubernetes Secrets (base64, not encrypted) → migrated to Key Vault CSI driver + Workload Identity
- Overprivileged users / service principals → scoped Azure RBAC to least privilege, custom roles instead of Contributor
Process: Triage by severity/CVSS → assign owner → remediate → rescan/retest → close ticket. Critical findings within agreed SLA."
| Term | Azure interview |
|---|---|
| IAM | Say Azure RBAC (not IAM — that's AWS) |
| K8s Secrets | Base64 ≠ encryption — use Key Vault CSI |
| ZAP | OWASP ZAP — DAST tool |
| Your tools | Trivy (images), ZAP (apps), config review (infra) |
- Difference between VA scan and PT?
- How do you prioritize findings?
- What is Key Vault CSI vs Kubernetes Secrets?
"Cluster access:
- Private AKS cluster — API server inside the VNet, not publicly accessible
- Microsoft Entra ID integration with Azure RBAC for Kubernetes — authorized access only, no shared kubeconfig
Network:
- Network Policies — deny all by default, allow only required pod-to-pod traffic
- WAF on Azure Front Door for ingress traffic to workloads behind the cluster
Identity & secrets:
- Workload Identity Federation + Managed Identity — pods access Key Vault via CSI driver, no static credentials or plain Kubernetes Secrets
Images & supply chain:
- ACR as trusted registry only; Defender for Cloud image scanning + Trivy in CI pipeline
Governance:
- Azure Policy add-on for AKS — enforce no privileged containers, required labels, approved SKUs
- Pod Security Admission (restricted/baseline)
Operations:
- Regular Kubernetes version upgrades and CVE patching
- Audit logs to Log Analytics"
| Your term | Interview polish |
|---|---|
| Entra ID | Correct — say Microsoft Entra ID + Azure RBAC for K8s |
| Federated identity | Workload Identity Federation |
| Policy engine | Azure Policy add-on for AKS |
| WAF + Front Door | Valid — ingress layer for apps on AKS |
- Private cluster — how do you run kubectl/CI/CD?
- Workload Identity vs Managed Identity?
- Network Policy example — deny all ingress?
"I use Azure Cost Management, Azure Monitor, and Azure Advisor to identify where money is being wasted.
How I find leakages:
- Cost Analysis by resource and tag — spot spikes and unattached resources
- Advisor recommendations — right-sizing VMs and databases, idle resources
- Hunt for orphaned disks, snapshots, NICs, and public IPs still billing
- Review non-prod running 24/7 — VMs and AKS node pools with no auto-scale-down
- Check storage tiers — Premium disks or Hot blob where Cool/Archive fits
Remediation / prevention:
- Right-size VMs and DB SKUs per Advisor
- Remove orphan resources and extra snapshots
- Cluster Autoscaler on AKS — scale nodes to demand, avoid idle capacity
- VM auto start/stop schedules for off-hours (dev/test)
- Blob lifecycle policies — move logs/backups to Cool/Archive
- Reserved Instances / Savings Plans for predictable steady workloads
- Tags via Azure Policy for cost allocation and showback
Cost leakage is usually orphaned resources and wrong SKU/tier — monthly FinOps review with Advisor + Cost Management."
| Identify (find waste) | Prevent (fix waste) |
|---|---|
| Cost Analysis, Advisor | Right-sizing, cleanup |
| Orphan disks/snapshots/IPs | Delete + Policy guardrails |
| Idle non-prod 24/7 | Auto-shutdown, Cluster Autoscaler |
| Wrong storage tier | Lifecycle policies |
- Difference between RI and Savings Plans? (Q43)
- How does Cluster Autoscaler save money?
- What tags do you enforce for FinOps?
- Reserved Instances / Savings Plans for stable AKS node pools and SQL — ~35% savings
- Spot node pools for CI/CD and batch workloads — ~70% compute savings on those pools
- Auto-shutdown schedules for dev/test VMs and scaled-down non-prod AKS at night
- Storage lifecycle policies — moved logs/backups to Cool/Archive tier — ~40% storage reduction
- Rightsizing via Azure Advisor recommendations — downsized 15 VMs
- Tagging enforcement via Policy — enabled showback/chargeback per team
- Consolidated Log Analytics workspaces to reduce ingestion duplication
| Option | Commitment | Flexibility | Discount | Best For |
|---|---|---|---|---|
| Reserved Instances (RI) | 1 or 3 year, specific VM family + region | Low — exchange possible for different SKU/region | Up to ~72% | Stable, predictable workloads (fixed SQL tier) |
| Savings Plans | 1 or 3 year, $/hour spend commitment | High — change VM type, region, applies across compute | Up to ~65% | Mixed/evolving workloads (AKS node pools) |
"Both are Azure cost-saving commitment options for 1–3 years.
Reserved Instances lock to a VM family and region — up to 72% discount. You can exchange RIs if needs change, but flexibility is limited.
Savings Plans commit to an hourly dollar spend — up to 65% discount — and flex across VM types and regions. Better when SKUs change, like AKS node pool upgrades.
I analyze 30–90 day utilization in Cost Management before purchasing. Savings Plans are my default for AKS; RIs for steady SQL/Cosmos tiers. Combined with Spot pools for CI/batch, we reduce overall compute cost significantly."
- Savings Plan = $/hour commit, flexible VM family
- RI = specific SKU, higher discount, less flexible
- AKS → prefer Savings Plans (node SKUs evolve)
- Always check 30–90 day usage before committing
- Can you use RI and Savings Plan together?
- What happens if you under-utilize the commitment?
- Spot vs Savings Plan for AKS?
"Storage optimization:
- Blob lifecycle policies — move logs/backups Hot → Cool → Archive automatically
- Delete orphaned disks, snapshots, and old unwanted archives
- Right redundancy tier — LRS for non-critical, ZRS for HA
Idle resource optimization:
- Use Azure Monitor and Advisor to find unused VMs, disks, public IPs, NICs
- Deallocate/remove idle VMs and associated disks
- Auto-shutdown schedules for dev/test VMs off-hours
- Scale down non-prod AKS node pools when not in use
- Monthly runbook to clean orphaned storage and unattached resources"
- Monitor + Advisor = find waste
- Lifecycle policies = prevent storage creep
- Orphan cleanup = disks, snapshots, NICs, IPs
"I follow a structured top-down approach — confirm scope, check infra, then drill into the database.
1. Confirm & correlate:
- When did it start? Recent deploy, config change, or Key Vault rotation?
- Check Grafana/App Insights — latency spike on DB dependency span in OpenTelemetry traces
- Rule out AKS/network — pod CPU, connection errors, Private Endpoint issues
2. Database metrics (platform layer):
- Azure SQL / Cosmos: DTU/vCore utilization, connection count, deadlocks, wait stats
- MongoDB Atlas: Performance Advisor, slow query logs, opcounters, replication lag, connections
- Postgres: active connections, lock waits,
pg_stat_statements3. Query analysis:
- Identify slow queries — missing indexes, COLLSCAN, full table scans
- Run
explain()/ Query Performance Insight- Check if app released a bad query or N+1 pattern after deploy
4. Resource & capacity:
- CPU/memory/IO saturation on DB tier — need scale-up?
- Connection pool exhaustion from AKS pods — too many replicas or leak?
- Redis cache miss rate — is load hitting DB directly?
5. Mitigate & escalate:
- Short-term: scale DB tier, kill long-running queries, enable read replica
- Engage DBA team for schema/index fixes
- Document RCA — index added, query optimized, pool size tuned
At HHS: Platform owns infra metrics and connectivity; DBA owns schema/query tuning — I bridge both with traces and connection troubleshooting."
| Layer | Check |
|---|---|
| App/trace | OTel — is DB span the bottleneck? |
| Connectivity | Private Endpoint, NSG, connection string |
| DB metrics | DTU/vCore, connections, replication lag |
| Queries | Slow query log, missing indexes, COLLSCAN |
| Cache | Redis hit rate — bypass cache? |
- How find slow query in MongoDB Atlas?
- Connection pool exhaustion — symptoms?
- Scale-up vs scale-out for Azure SQL?
"I use impact × urgency matrix:
Priority Criteria Example P1 Full outage, revenue/security impact Payment down, data breach P2 Major degradation, workaround exists Slow checkout, single region P3 Minor impact, few users Non-critical report failing P4 Cosmetic / planned UI alignment issue Rules:
- P1 always wins — all hands, incident commander assigned
- If two P1s: business impact (revenue > internal tools)
- Communicate reprioritization to stakeholders
- Don't context-switch — assign owners per incident
- Post-incident: review if alerts should have fired earlier"
Situation: At HHS, after a Terraform apply that recreated cluster identity resources, multiple microservices entered ImagePullBackOff — pods could not pull container images from ACR. This was a P1 — production workloads for state government clients were failing to start. Grafana alert fired on pod restart count.
Task: As on-call platform engineer, I needed to restore image pull capability and get pods running — fast.
Action:
- Joined Teams war room, confirmed no Azure regional outage (Service Health clean)
kubectl describe pod→ Events showed 401/403 unauthorized pulling from ACR- Correlated with recent Terraform apply — Managed Identity was recreated but AcrPull role assignment was missing
- Fixed RBAC in Terraform — re-applied
AcrPullon the kubelet identity → ACR- Triggered pod restart / redeploy → verified images pulling and pods Running
- Notified client team of status and resolution
Result:
- Service restored in ~10 minutes
- Added Terraform validation step to verify MI role assignments post-apply
- Added Grafana alert for ImagePullBackOff with linked runbook
- Documented RCA in ticket; shared learnings in post-mortem
Lesson: Infrastructure changes that touch identity must include RBAC validation — ACR pull depends on
AcrPullrole on the cluster's Managed Identity."
- Use your real story — ImagePullBackOff, not a generic outage
- STAR format: Situation → Task → Action → Result
- Root cause: Terraform + missing AcrPull — shows depth
- How did you diagnose it was ACR auth and not a bad image tag?
- What is AcrPull role?
- How prevent this in future?
kubectl get pods -A -o wide
kubectl describe pod <name> -n <ns>
kubectl logs <pod> -n <ns> --previous -f
kubectl top pods/nodes -A
kubectl rollout status deployment/<name>
kubectl rollout undo deployment/<name>
kubectl exec -it <pod> -n <ns> -- /bin/sh
kubectl get events -n <ns> --sort-by='.lastTimestamp'// Failed requests last hour
requests | where timestamp > ago(1h) and success == false | summarize count() by resultCode, name
// Exceptions spike
exceptions | where timestamp > ago(1h) | summarize count() by type, outerMessage
// Pod restarts
KubePodInventory | where PodRestartCount > 0 | project Computer, Name, PodRestartCountaz aks get-credentials --resource-group <rg> --name <cluster>
az vm list-usage --location eastus
az backup recoverypoint list --vault-name <vault> --container-name <vm>
az network watcher test-connectivity --source-resource <vm> --dest-address <ip>
az aks upgrade --resource-group <rg> --name <cluster> --kubernetes-version 1.28.5- What does the Azure landing zone / platform team structure look like?
- What is the current split between IaaS, PaaS, and AKS workloads?
- How mature are CI/CD and GitOps practices today?
- What does the on-call rotation and incident management process look like?
- Are there multi-cloud requirements (Azure + GCP/AWS)?
- What are the top 3 platform challenges you're hiring this role to solve?
- How does the team approach FinOps and cost accountability?
- What does success look like in the first 90 days?
- Customize all
[PLACEHOLDER]and sample company references - Prepare 2–3 STAR stories (incident, cost optimization, migration)
- Be ready to whiteboard hub-spoke + AKS architecture
- Review Azure Service Health for any recent outages (shows awareness)
- Have salary expectations and notice period ready
- Test video/audio if virtual interview
- Keep this doc open on second monitor for quick reference (don't read verbatim!)
Questions interviewers ask after your main answer. Prepare 2–3 sentences each.
| Follow-up | Senior answer hint |
|---|---|
| Why Azure over AWS for this project? | Gov compliance, client requirement, AKS + Azure DevOps integration, existing Entra ID |
| What was your biggest architecture challenge? | Private AKS + PE for all PaaS + multi-state latency via Front Door |
| How many environments? | Dev, QA, Prod — separate namespaces/subscriptions, promotion via pipeline |
| Follow-up | Senior answer hint |
|---|---|
| Private vs public AKS cluster? | Private API server, private nodes, Bastion for admin, no public kube API |
| How do pods pull from ACR privately? | PE on ACR + Managed Identity AcrPull + Private DNS |
| Node pool failed to provision? | Quota, subnet full, NSG blocking, SKU unavailable — check az aks nodepool list events |
| AKS upgrade failed mid-way? | Surge nodes, cordon/drain, rollback node pool to previous K8s version |
| Follow-up | Senior answer hint |
|---|---|
| Deployment vs StatefulSet? | Deployment = stateless; StatefulSet = stable identity + PV for DB |
| How do you rollback a bad deploy? | helm rollback, same ACR tag, never deploy latest in prod |
| SonarQube quality gate failed — what now? | Fix code or get exception approval; never bypass gate in gov prod |
| GitLeaks found a secret? | Revoke key in Azure, rotate in Key Vault, rewrite git history or invalidate commit |
| Follow-up | Senior answer hint |
|---|---|
| NSG vs Azure Firewall? | NSG = L3/L4 subnet/NIC rules; Firewall = centralized L3–L7 app rules, IDPS |
| PE vs Service Endpoint? | PE = private IP in VNet; Service Endpoint = VNet routing, service still has public IP |
| DNS not resolving for PE? | Private DNS Zone not linked to VNet, or AKS using public DNS forwarder |
| Follow-up | Senior answer hint |
|---|---|
| Where are VM backups stored? | Recovery Services Vault — not Blob (Velero uses Blob) |
| RTO vs RPO? | RPO = max data loss window; RTO = max downtime; define per tier |
| Tested restore ever? | Yes — GitLab VM restore drill, Velero namespace restore to QA |
| # | Question | Why it matters |
|---|---|---|
| E1 | Explain Workload Identity Federation step-by-step | You mentioned OIDC — they will probe |
| E2 | How does AGIC route traffic? | App Gateway → AGIC → Ingress → Service → Pod |
| E3 | Terraform state locking — how? | Blob backend + native state lock |
| E4 | How do you secure pipeline service connections? | Workload Identity Federation, no secrets in YAML |
| E5 | What Azure Policy do you apply on AKS? | No privileged containers, required labels, allowed images |
| E6 | How do you handle secrets in Git? | GitLeaks in CI, Key Vault CSI in cluster, never commit |
| E7 | Difference between LRS, GRS, ZRS? | Redundancy tiers for Storage and Backup vault |
| E8 | How do you do zero-downtime deploy on AKS? | Rolling update + readiness probes + Helm maxUnavailable |
Practice Senior Answer sections aloud. Use Important Points as a checklist before each mock question.