Skip to content

Latest commit

 

History

History
1875 lines (1491 loc) · 86.3 KB

File metadata and controls

1875 lines (1491 loc) · 86.3 KB

Azure & DevOps Interview Preparation Guide


🎯 SENIOR INTERVIEW MODE (Read Before Every Session)

You are preparing for a senior Azure/DevOps role (8–9 years expected depth). Your real experience is ~6 years — that is enough if you demonstrate senior-level thinking: architecture, security, trade-offs, failure scenarios, and cross-team ownership.

How Senior Answers Differ from Junior

Junior (❌) Senior (✅)
"Private endpoint uses private IP" Same + Private Link, Private DNS Zone, disable public access, NSG on PE subnet, gov compliance reason
Lists tools Explains why you chose them and what breaks without them
One-line definitions Definition + HHS example + follow-up readiness
"I managed AKS" "I owned AKS platform — create, secure, upgrade, DR, cost, incidents"

Document Format (per question)

Section Purpose
Senior Answer (8–9 Years) Complete interview-ready response — practice this
Important Points Terms, order, and traps — do not skip in the interview
Likely Follow-Up Questions Cross-questions interviewers ask next

Customize answers with HHS / AKS / 30+ microservices where relevant. Do not claim tools you have not used.

See Follow-Up Question Bank at the bottom of this doc.


Your Profile (Mock Interview — Live Updates)

Item Your Details
Name X
Interview target Senior Azure DevOps / Platform Engineer (8–9 yr depth)
Real experience ~6 years hands-on — frame as senior-scope ownership
Cloud Azure (primary) + AWS
Certs AZ-900, AWS Solutions Architect
Recent project HHS — healthcare, US state gov (FL, IA, SD), 30+ microservices on AKS
Core work Platform ownership: AKS, Terraform, Azure DevOps, DR, security, FinOps
Company HHS
Job title DevOps / Cloud Engineer (platform owner on gov healthcare)

How to Use This Document

  1. Practice aloud — aim for 60–90 seconds per intro question, 2–3 minutes for architecture questions.
  2. Use STAR (Situation, Task, Action, Result) for behavioral and incident questions.
  3. Mock interview progress is tracked below — one question at a time.

Mock Interview Progress

# Question Status
1 Introduce yourself
2 Years in Azure & DevOps
3 Recent project architecture
4 Azure services worked on
5 VMs, Storage, Backup
6 DR & Azure Backup
7 AKS clusters
8 Pods, Deployments, Services
9 CI/CD tools
10 Dev to Prod deployment
11 VNets and NSGs
12 Public vs Private Endpoints
13 Monitoring tools
14 Troubleshoot production
15 Databases
16 DBA activities
17 Production incidents
18 Client coordination
19 Azure environment architecture
20 Unreachable VM troubleshooting
21 DR testing in Azure
22 Complete AKS architecture
23 AKS cluster upgrades
24 CrashLoopBackOff troubleshooting
25 HPA and Cluster Autoscaler
26 Slow AKS app troubleshooting
27 VNet/NSG/Route Tables
28 ExpressRoute and VPN ✅ Passed
29 Azure DevOps pipeline architecture ✅ Passed
30 Blue-Green and Canary ✅ Passed
31 Production rollbacks ✅ Passed
32 Monitoring tools integrated ✅ Passed
33 Production alerts configured ✅ Passed
34 MongoDB performance monitoring ✅ Passed
35 Redis replication and failover ✅ Passed
36 GCP Cloud Run / BigQuery / Airflow ⏳ (repeat at end)
37 Multi-cloud monitoring (Azure + GCP) ✅ Passed
38 Security hardening in Azure ✅ Passed
39 VA/PT findings ✅ Passed
40 Secure AKS clusters ✅ Passed
41 Cloud cost leakages ✅ Passed
42 Cost optimization initiatives ⏭️ Skipped (same as Q41)
43 Reserved Instances & Savings Plans ✅ Passed
44 Storage & idle resource optimization ✅ Passed
45 Database response time troubleshooting ✅ Passed
46 Prioritize multiple production issues ✅ Passed
47 Critical production issue (STAR)

SECTION 1: INTRODUCTION & BACKGROUND

Document format: Each question has a Senior Answer, Important Points, and Likely Follow-Ups. No raw mock transcripts — customize with your HHS experience when practicing.


1. Please introduce yourself and explain your current role and responsibilities.

Senior Answer (8–9 Years)

"Good morning. I'm X, a DevOps Engineer with six years of experience building secure and scalable infrastructure on Azure and AWS, with my primary focus on Microsoft Azure.

In my role at HHS, I worked on a healthcare platform for US state government clientsFlorida, Iowa, and South Dakota. I owned end-to-end infrastructure: provisioning cloud resources, deploying applications on AKS, and day-to-day Kubernetes cluster operations — including upgrades, monitoring, and troubleshooting.

My responsibilities fall into four areas:

  • Platform & Kubernetes: AKS cluster maintenance, node pool updates, ingress, and workload deployments
  • CI/CD & Automation: Pipelines for 30+ microservices in Azure DevOps and Jenkins — including automated tasks like restoring non-production Postgres data from production to dev/stage environments on demand
  • Operations: Incident response, cluster health monitoring, and fast troubleshooting during production issues
  • Security & Compliance: Government healthcare compliance requirements and secure pipeline practices

I hold AZ-900 and AWS Solutions Architect certifications. I'm looking to bring my hands-on AKS and Azure DevOps experience into a role where I can own platform reliability at scale.

Thank you."


2. How many years of experience do you have in Azure and DevOps?

Senior Answer (8–9 Years)

"I have six years of experience in DevOps and cloud engineering, working across both Azure and AWS.

Azure: Approximately four years of hands-on experience — including VMs, VNets, NSGs, Application Gateway, AKS, ACR, and Azure DevOps pipelines. At HHS, my recent production work has been primarily on Azure — deploying and operating 30+ microservices on AKS using Azure DevOps for CI/CD.

AWS: I also have solid AWS exposure from earlier projects — EC2, S3, IAM, and pipeline integrations — which helps when working in multi-cloud environments.

My growth path:

  • Early years: Git, Jenkins, basic cloud VMs, CI/CD fundamentals
  • Mid years: Infrastructure as Code with Terraform, containerization with Docker
  • Recent years: Deep Azure focus — AKS cluster operations, Azure DevOps repos and pipelines, RBAC for team access, and government healthcare compliance

So while I'm comfortable on both clouds, my deepest recent production experience is on Azure — especially AKS, Azure DevOps, and day-to-day platform operations. I'm fully comfortable owning infrastructure through deployment and production support."

3. Describe your recent project architecture.

Senior Answer (8–9 Years)

"My recent project at HHS is a healthcare platform for US state governments — Florida, Iowa, South Dakota — running 30+ microservices on Azure.

1. Infrastructure (IaC): Everything is provisioned with Terraform — VNet, subnets, AKS, ACR, Application Gateway, Front Door, Key Vault, and supporting resources.

2. Networking: We use a VNet with private subnets. AKS runs on private nodes — no direct public exposure. Azure Front Door is the global entry point, then traffic flows to Application Gateway (WAF), which routes to the cluster via the Ingress Controller (AGIC).

3. Traffic flow:

User (Browser)
  → Azure DNS
  → Azure Front Door (CDN/WAF/global routing)
  → Application Gateway (WAF, SSL termination)
  → Ingress Controller (inside AKS)
  → Kubernetes Services
  → Pods (30+ microservices)
  → PostgreSQL (operational DB)

4. CI/CD (Azure DevOps): On pipeline trigger → code passes DevSecOps checks → Docker image built and pushed to ACR → deployed to AKS via Helm/kubectl. Secrets come from Azure Key Vault using service connections and Managed Identity — never stored in code.

5. Monitoring: Prometheus + Grafana inside the cluster for metrics; logs and operational DB monitoring for health and troubleshooting.

6. DR/Backup: VM snapshots, Azure Backup, and Site Recovery for critical components.

Environments: Dev → Stage → Production with separate namespaces/subscriptions and promotion via Azure DevOps pipelines."


4. Which Azure services have you worked on extensively?

Senior Answer (8–9 Years)

"At HHS, I've worked hands-on across the full Azure stack for a healthcare platform with 30+ microservices:

Compute & containers: AKS (private clusters, node pools, upgrades), Azure VMs (early phase + GitLab), Azure Functions (VM auto-shutdown for cost)

Networking: VNet, subnets, NSGs, route tables, Azure Front Door, Application Gateway (WAF), AGIC, Private Endpoints, Private DNS zones, Azure Firewall, Azure Bastion

DevOps & registry: Azure DevOps (CI/CD), ACR, Argo CD (GitOps), Terraform (IaC)

Security & identity: Key Vault, Microsoft Entra ID, Managed Identity, Workload Identity Federation, Azure RBAC, Azure Policy, Defender for Cloud

Monitoring: Azure Monitor, Log Analytics, Container Insights, Application Insights — plus Prometheus/Grafana/OpenTelemetry on AKS

Storage & backup: Blob Storage, managed disks, Azure Backup, Recovery Services Vault, Azure Site Recovery

Data services: Azure Database for PostgreSQL, MongoDB Atlas, Azure Cache for Redis Premium

I provision and operate most of these via Terraform — not ad-hoc portal clicks."

Important Points

  • Group by category (compute, network, security, data) — shows breadth
  • Lead with AKS — your core strength
  • Mention Terraform — shows IaC discipline

Likely Follow-Up Questions

  • Which service do you know deepest? (AKS + networking)
  • How do Private Endpoints work with AKS?
  • Difference between Front Door and Application Gateway?

5. Have you managed Azure VMs, Storage Accounts, and Backup services?

Senior Answer (8–9 Years)

"Yes, I have managed all three extensively — especially in our early VM-based phase before we moved to AKS.

Azure VMs:

  • Hosted applications directly on VMs and self-hosted GitLab on a dedicated VM
  • Ran Docker workloads on VMs before containerizing to AKS
  • Built Azure Functions to automatically shut down dev/test VMs during off-peak hours (nights/weekends) for cost savings
  • Connected via Azure Bastion — no public RDP/SSH

Backup & Snapshots:

  • Configured daily VM backups via Azure Backup — e.g., GitLab VM backed up daily after business hours
  • Took disk snapshots before changes and for restore drills
  • Stored backups in Recovery Services Vault with defined retention

Storage Accounts:

  • Used Azure Blob Storage for application data and file storage
  • Stored Terraform remote state in Azure Blob — team-shared, version-controlled infrastructure
  • Configured appropriate access with RBAC and private access where needed

So yes — full lifecycle: provision VMs → backup/snapshot → cost optimize → migrate to AKS when ready."


6. Have you configured Disaster Recovery or Azure Backup?

Senior Answer (8–9 Years)

"Yes — I've configured both Azure Backup and Azure Site Recovery (ASR), plus AKS-specific backup with Velero.

Azure Backup (VMs & Disks):

  • Point-in-time restore with scheduled backup policies
  • Retention: typically 30–90 days based on compliance and RPO requirements
  • Frequency: weekly by default; daily or hourly for high-change or critical workloads
  • Stored in Recovery Services Vault (Microsoft-managed — not a Storage Account we pick)
  • Used for VMs (e.g., self-hosted GitLab) and managed disk backups for StatefulSet PVs

Azure Site Recovery (DR):

  • Replicates VMs from primary region to secondary region
  • On disaster → test failover or committed failover → services continue in DR region
  • Combined with Azure Front Door / DNS cutover for traffic routing

AKS Backup (Velero):

  • Namespace-level incremental backups — twice weekly
  • Backup storage in Azure Blob Storage
  • Can restore namespaces to same or another region
  • PV/PVC (StatefulSets): disk-level backup via Azure Backup on underlying managed disks

Key Vault: Yes — Key Vault supports backup and restore of secrets, keys, and certificates (az keyvault backup). We also enable soft delete and purge protection for compliance.

Full DR strategy: Terraform recreates infrastructure in DR region → restore Velero (K8s resources) from Blob → restore disks/VMs from Azure Backup → ASR failover for replicated VMs → DNS/Front Door cutover."


7. Have you created or managed AKS clusters?

Senior Answer (8–9 Years)

"Yes — AKS is my primary platform at HHS. I own full cluster lifecycle for 30+ microservices across dev, stage, and production.

Provisioning: Clusters provisioned via Terraform — private cluster, VNet-integrated, Azure CNI, separate system and user node pools.

Day-to-day operations:

  • Cluster and node pool upgrades (Kubernetes version, node image)
  • Node pool management — standard pools for apps, Spot pools for CI/batch
  • Ingress via AGIC + Application Gateway + Front Door
  • Add-ons: Prometheus/Grafana, Velero, Key Vault CSI, OpenTelemetry collector

Security & identity:

  • Private cluster — API server not public
  • Workload Identity Federation + Managed Identity for ACR pull and Key Vault access
  • Network policies, Azure Policy add-on, Pod Security Standards

Deployments: Helm via Azure DevOps pipelines + Argo CD GitOps sync

Backup: Velero namespace backups to Blob; disk-level backup for StatefulSet PVs

Troubleshooting: CrashLoopBackOff, ImagePullBackOff, HPA/Cluster Autoscaler tuning, performance issues — daily operational work."

Important Points

  • Private cluster + Terraform = senior signals
  • Separate system vs user node pools
  • Identity: Workload Identity not service principal secrets

Likely Follow-Up Questions

  • How do you upgrade AKS without downtime?
  • System vs user node pool — why separate?
  • How do pods pull images from ACR without credentials?

8. What are Pods, Deployments, Services, and Namespaces?

Senior Answer (8–9 Years)

  • Pod: Smallest deployable unit in Kubernetes. One or more containers sharing network and storage. Ephemeral — if it dies, Kubernetes creates a new one.

  • Deployment: Controller that manages pod replicas, rolling updates, and rollbacks. You declare desired state (e.g., 3 replicas); Deployment keeps it.

  • Service: Stable network endpoint for pods (which have changing IPs). Routes traffic via label selectors. Types:

    • ClusterIP — internal only (most common)
    • NodePort — exposes on node port
    • LoadBalancer — external access via cloud LB (e.g., Azure Load Balancer)
  • Namespace: Logical isolation within a cluster — separate dev/staging/prod, RBAC, quotas, and network policies.

One-liner: Pod runs containers → Deployment manages pods → Service gives stable address → Namespace isolates workloads.

Bonus (if asked): StatefulSet for stateful apps like databases — stable pod names (pod-0, pod-1), ordered deployment, and PersistentVolumes for storage."


9. Which CI/CD tools have you used?

Senior Answer (8–9 Years)

"My primary CI/CD platform is Azure DevOps — pipelines for 30+ microservices at HHS. I also have experience with Jenkins (OWASP ZAP DAST integration) and Argo CD for GitOps.

Azure DevOps (main):

  • CI: Build → test → GitLeaks (secrets) → Trivy (vulnerabilities) → SonarQube (quality gate) → Docker build → push to ACR
  • CD: Promote same image tag (buildId + git SHA) through Dev → Stage → Prod with manual approval on production
  • Deploy to AKS via Helm; service connections use Workload Identity Federation

Argo CD (GitOps):

  • Watches Git repo for manifest/Helm value changes → auto-syncs to cluster
  • Separates what runs (Git) from how it's built (pipeline)

Terraform (IaC pipeline):

  • Separate pipeline — terraform plan on PR, terraform apply on merge
  • Provisions VNet, AKS, Key Vault, App Gateway, etc.

Jenkins: Used for application DAST scanning with OWASP ZAP and reporting to dev team.

Image tagging rule: Never latest in production — immutable tags by git SHA."

Important Points

Tool Role
Azure DevOps CI/CD — build, scan, deploy
Argo CD GitOps — cluster sync from Git
Terraform IaC — infrastructure pipeline
Jenkins DAST / legacy pipelines

Likely Follow-Up Questions

  • Helm vs Argo CD — when use which?
  • What happens if SonarQube gate fails?
  • How do you secure pipeline service connections?

10. Explain your deployment process from Dev to Production.

Senior Answer (8–9 Years)

"We follow a promotion-based pipeline across Dev → Stage → Production with separate namespaces and environment-specific configuration.

1. Developer flow:

  • Feature branch → PR → merge to develop
  • CI pipeline triggers automatically

2. CI (build once, deploy many):

  • Checkout → GitLeaksTrivy filesystem scan → build → unit tests → SonarQube quality gate
  • Docker build → Trivy image scan → push to ACR with tag buildId-gitSHA
  • Same image is promoted — never rebuild per environment

3. CD — environment promotion:

  • Dev: Auto-deploy on CI success — smoke test
  • Stage/QA: Integration tests, regression, performance checks
  • Production: Manual approval gate → deploy via Helm to AKS
  • Argo CD syncs Git-managed manifests where GitOps is enabled

4. Configuration per environment:

  • Helm values / ConfigMaps differ per env (replicas, endpoints, feature flags)
  • Secrets from Key Vault via CSI driver — never in Git or pipeline YAML

5. Post-deploy verification:

  • Health probes pass, Grafana dashboards clean, smoke test
  • Rollback ready: helm rollback or redeploy last-known-good ACR tag

6. Infrastructure changes:

  • Terraform in separate pipeline — plan on PR, apply on merge — never manual portal changes

Gov healthcare: Change windows, client notification for prod releases, audit trail in Azure DevOps."

Important Points

  • Build once, promote same tag — critical senior principle
  • Manual approval for production
  • Secrets from Key Vault, not pipeline variables
  • Terraform separate from app CD

Likely Follow-Up Questions

  • How do you handle database migrations across environments?
  • What if Stage passes but Prod fails?
  • How do you rollback production?

11. What are VNets and NSGs?

Senior Answer (8–9 Years)

VNet (Virtual Network): A private, logically isolated network in Azure. You define an IP address range (CIDR) — e.g., 10.0.0.0/16 — and divide it into subnets. Resources like AKS nodes, VMs, and Application Gateway get private IPs from these subnets. VNets can peer with other VNets and connect on-premises via VPN/ExpressRoute.

NSG (Network Security Group): A firewall rule set that controls inbound and outbound traffic. Attached to a subnet or network interface (NIC). Rules define source/destination IP, port, protocol, and allow/deny — evaluated by priority. Example: allow 443 from App Gateway subnet; deny all other inbound.

Together: VNet = network layout; NSG = traffic filtering. In our HHS project, AKS runs in a private subnet with NSGs controlling what traffic can reach the cluster.


12. What is the difference between Public and Private Endpoints?

Senior Answer (8–9 Years)

"Public Endpoint: The Azure PaaS service is reachable over the public internet via a public hostname/IP — e.g., myaccount.blob.core.windows.net. Traffic traverses the internet. You rely on firewall rules, SAS tokens, or service-level auth.

Private Endpoint: A private IP from your VNet subnet is assigned to the PaaS service via Azure Private Link. Traffic stays on the Microsoft backbone — never exposed to the public internet. DNS resolves to the private IP via a Private DNS zone.

When we use Private Endpoints (production at HHS):

  • ACR — pods pull images without public registry access
  • Key Vault — secrets accessed only from VNet
  • Blob Storage, PostgreSQL, Redis Premium — no public exposure

Key differences:

Public Endpoint Private Endpoint
Network path Internet Microsoft backbone → VNet
IP Public Azure IP Private IP in your subnet
DNS Public FQDN Private DNS zone required
Security Firewall + auth rules Network isolation by default
Use case Dev/test, public APIs Production, compliance (healthcare)

Service Endpoint vs Private Endpoint: Service Endpoint routes VNet traffic to Azure service but the service still has a public endpoint. Private Endpoint is stronger isolation — dedicated private IP per service.

At HHS, all production PaaS uses Private Endpoints — required for government healthcare compliance."

Important Points

  • Private Endpoint = Private Link + Private DNS zone
  • Production healthcare → Private Endpoints on all PaaS
  • Don't confuse Service Endpoint with Private Endpoint

Likely Follow-Up Questions

  • How does AKS pod resolve Key Vault DNS with Private Endpoint?
  • Can you use both public and private on same storage account?
  • What is Private Link?

13. Which monitoring tools have you used?

Senior Answer (8–9 Years)

"We use a three-pillar observability stack on Azure — metrics, logs, and traces:

Metrics — Prometheus + Grafana:

  • Prometheus scrapes AKS pod/node metrics (CPU, memory, restarts, HTTP rates)
  • Grafana dashboards per team/service with alert rules — e.g., pod restart count, latency P95, node disk pressure
  • Alerts route to email/Teams; on-call checks dashboard and logs

Logs — Elasticsearch + Loki:

  • Loki for Kubernetes pod logs; Elasticsearch for search and log parsing
  • Track application events, errors, and audit trails for gov compliance

Traces — OpenTelemetry:

  • OTel SDK in microservices exports traces end-to-end
  • Identify latency bottlenecks across service dependencies (API → DB → cache)

Azure-native (platform layer):

  • Azure Monitor + Log Analytics for platform logs and KQL queries
  • Container Insights for AKS node/pod health
  • Application Insights for APM where enabled
  • Azure Service Health — rule out regional outages first

At HHS: On alert fire → check Grafana → drill Loki/Elasticsearch logs → OTel trace for latency → kubectl if K8s issue. MTTR improved by correlating all three pillars.

Innovation (if asked): We piloted a multi-agent LLM integration (Gemini/OpenAI API) that sends alert context + logs to the model and returns RCA hints and remediation steps — assists engineers but does not replace human approval for prod changes."

Mathew's Answer — What You Said

Tool Your use
Prometheus Metrics collection
Grafana Dashboards + alerts
OpenTelemetry Distributed traces, latency
LLM multi-agent RCA + remediation advice from logs

14. How do you troubleshoot production issues?

Senior Answer (8–9 Years)

"Production issues are alert-driven. Grafana/Azure Monitor alerts fire for pod problems — CrashLoopBackOff, ImagePullBackOff, Evicted, Pending — or application-down alerts.

My 6-step process:

  1. Triage — Check priority (P0/P1/P2), blast radius, how many users/states affected
  2. Correlate — Azure Service Health, recent deployments, config/Key Vault changes
  3. Observe — Grafana dashboards → Loki/Log Analytics logs → OpenTelemetry traces
  4. Isolatekubectl describe pod, kubectl get events, kubectl logs --previous
  5. Mitigate first — rollback Helm release, fix image tag, scale nodes — restore service before root cause
  6. Document — Update ticket with RCA, close after verification, post-mortem for P0/P1

Example — ImagePullBackOff on AKS:

  • kubectl describe pod → check Events section
  • Common causes: (1) wrong image tag not in ACR (2) AcrPull role missing on Managed Identity (3) auth to private ACR failed (4) PE/DNS issue
  • Fix: verify tag in ACR, check AcrPull RBAC, verify Workload Identity / service account, re-deploy

Example — App-to-app connectivity failure:

  • kubectl exec → ping service DNS name (not IP — IPs change)
  • If network OK → check environment variables, ConfigMaps, Secrets from Key Vault CSI
  • Check namespace ResourceQuota — pod may be Pending if quota exceeded

For gov healthcare: Communicate status to client on P0/P1 every 15–30 min until resolved."


🎫 Incident Priority Guide (P0 / P1 / P2 / P3) — Learn This

Companies vary slightly — this is the standard enterprise model interviewers expect:

Priority Also called Meaning Response time Who acts Example (HHS healthcare)
P0 Sev-1 / Critical Complete outage — production down, no workaround Immediate (< 15 min) All hands, incident commander, war room All state portals down, PHI system unreachable
P1 Sev-2 / High Major degradation — core feature broken, large user impact < 30–60 min On-call + platform lead + dev lead One state can't submit claims; 50% error rate
P2 Sev-3 / Medium Partial impact — workaround exists, limited users < 4 hours On-call engineer Single microservice down, other states OK
P3 Sev-4 / Low Minor — cosmetic, dev/test only, no prod impact Next business day Assign to backlog Grafana dashboard broken, non-prod pod crash

15. Which databases have you worked with?

Senior Answer (8–9 Years)

"I've worked with several databases across our projects:

Database Use case Environment pattern
MongoDB Primary app datastore (HHS) Self-hosted in Dev/QA (cost savings); MongoDB Atlas (or managed) in Production
PostgreSQL Operational/relational data Azure Database for PostgreSQL Flexible Server — used for prod→stage data restore pipelines
Azure SQL Microsoft SQL workloads Transactional apps, HA with zone redundancy
Cosmos DB NoSQL / globally distributed Low-latency, multi-region scenarios

Environment strategy: Lower environments use lighter/cheaper DB (self-hosted MongoDB on AKS/VM) — production uses fully managed PaaS with HA. We don't replicate prod infrastructure in Dev/QA for cost optimization.

HA & DR (platform layer — my responsibility):

  • Enable zone-redundant HA on Azure PostgreSQL/SQL where required
  • Geo-redundant backup and point-in-time restore (PITR) on PaaS databases
  • Scheduled logical backups (pg_dump / mongodump) to Blob Storage for extra protection and cross-region recovery
  • Private Endpoints for DB connectivity from AKS — no public exposure

I'm not a full-time DBA — I own provisioning, connectivity, backup, HA config, and restore pipelines; DBAs/dev teams own query tuning."


16. What activities do you perform in database administration?

Senior Answer (8–9 Years)

"We had a dedicated DBA team, but as a platform engineer I owned the infrastructure and operational side of databases:

Provisioning & configuration:

  • Created and configured PostgreSQL Flexible Server, Cosmos DB, and MySQL via Terraform
  • Sized instances based on workload — vCores, storage, HA mode (zone-redundant)
  • Enabled Private Endpoints, disabled public access, configured Private DNS Zones

Connectivity & identity:

  • Troubleshot connection issues between AKS pods and databases
  • Set up Managed Identity and service accounts for apps to authenticate without passwords in code
  • Stored connection strings in Key Vault; apps pull via Key Vault CSI driver

Backup & environment sync:

  • Built Azure DevOps pipelines to dump non-critical prod data and restore to Dev/Stage — so developers work with realistic data while masking sensitive fields
  • Enabled native PITR on PaaS DBs; additional pg_dump/mongodump to Blob for pipeline-driven restores

Performance & cost (with DBA team):

  • Monitored vCore/DTU utilization, connection count, slow query logs in Azure Portal
  • Right-sized instances based on metrics — scale up/down for cost optimization
  • Escalated query tuning and indexing to DBAs — my role stops at platform layer

Boundary: I own infra, access, backup, connectivity, sizing — DBAs own schema, queries, indexes."


17. Have you handled production incidents?

Senior Answer (8–9 Years)

"Yes — regularly. At HHS, I was on-call for a healthcare platform serving US state governments. Incidents are alert-driven and prioritized P0/P1/P2/P3.

My incident process:

  1. Triage — severity, blast radius, users affected
  2. War room — Teams channel for P0/P1, assign incident commander
  3. Mitigate first — rollback, scale, fix config — restore service before deep RCA
  4. Investigate — Grafana → logs → OTel traces → kubectl describe/logs
  5. Communicate — client status updates every 15–30 min on P0/P1
  6. Post-mortem — RCA document, action items, alert tuning

Real example — ImagePullBackOff (P1):

  • Situation: After Terraform apply, multiple pods entered ImagePullBackOff — microservices couldn't pull from ACR
  • Action: kubectl describe pod → missing AcrPull role on Managed Identity after identity recreation → fixed RBAC in Terraform → redeployed
  • Result: Restored in ~10 minutes; added validation for MI permissions post-Terraform changes

Other incidents handled: CrashLoopBackOff (bad config/secret), pod eviction (resource pressure), connectivity failures after Key Vault rotation, slow AKS apps (DB/ingress bottlenecks).

Follow-the-sun: Onshore US + offshore India — handoff notes in ticket, clear ownership per incident."

Important Points

  • Mitigate first, RCA later — senior signal
  • Use real STAR example (ImagePullBackOff)
  • Gov healthcare → client communication on P0/P1

Likely Follow-Up Questions

  • Walk me through your ImagePullBackOff incident (STAR)
  • P1 vs P2 — how do you decide?
  • What goes in a post-mortem?

18. Have you coordinated with clients and cross-functional teams?

Senior Answer (8–9 Years)

"Yes — extensively. At HHS, we support US state government clients (Florida, Iowa, South Dakota) with a follow-the-sun modelonshore US team + offshore India team.

Cross-functional coordination:

  • Onshore (US): client-facing, business context, production access, stakeholder updates
  • Offshore (India): platform engineering, AKS ops, CI/CD, Terraform, incident troubleshooting
  • Developers: bug identification, app logs, env var / config fixes
  • DBA team: database performance, schema, data masking
  • Security: VA/PT remediation, compliance for gov healthcare

How we work:

  • Join Teams war rooms for P0/P1 — even if not our component, we contribute until resolved
  • Status updates every 15 minutes during critical incidents
  • I have led incident bridge calls — coordinate onshore + offshore + dev
  • For new infrastructure or cost optimization — we propose, onshore validates with client
  • Release windows coordinated across time zones — US team approves prod changes

Example: During ImagePullBackOff P1, I led troubleshooting on the bridge call — offshore fixed Terraform RBAC while onshore updated the client every 15 min until resolved in 10 minutes.

Strong rapport built over years — clear communication is as important as technical skills in gov healthcare."

SECTION 2: DEEP-DIVE TECHNICAL QUESTIONS


19. Explain the architecture of your current Azure environment.

Senior Answer (8–9 Years)

Overview: HHS healthcare platform — 30+ microservices on private AKS, provisioned via Terraform, US state government clients.

Traffic flow:

User (Browser → URL)
  → Azure DNS
  → Azure Front Door (global routing to nearest healthy region)
  → Application Gateway (WAF + SSL)
  → AGIC (Application Gateway Ingress Controller)
  → Ingress → Services → Pods
  → Databases (PostgreSQL / MongoDB)

Networking: Everything in a private VNet with dedicated subnets — AKS nodes, App Gateway, Private Endpoints in separate subnets. No public exposure on AKS. Azure Bastion for secure engineer access (no public SSH/RDP).

Integrations: ACR + Key Vault via Managed Identity and Workload Identity Federation. Private Endpoints on PaaS services. Azure DevOps CI/CD → ACR → Helm → AKS.

Observability: Prometheus, Grafana, OpenTelemetry, Loki/Elasticsearch.

Hub-spoke: Our current project uses a single VNet architecture (not hub-spoke). I understand hub-spoke conceptually — hub centralizes firewall/VPN/Bastion; spokes connect via VNet peering; on-premises via ExpressRoute/VPN to hub. Ready to implement if required."


20. How do you troubleshoot a VM that is suddenly unreachable?

Senior Answer (8–9 Years)

"I follow a layered checklist — platform first, then network, then OS:

1. Azure platform (Portal):

  • VM power state — running or stopped?
  • Resource Health + Azure Service Health — region outage?
  • Activity Log — recent NSG change, patch, reboot?

2. DNS (if using hostname):

  • nslookup — does FQDN resolve to correct public IP?
  • Verify Azure DNS A-record matches assigned IP

3. Networking:

  • NSG rules — inbound 22/3389 or app port allowed?
  • ASG (Application Security Group) rules if used
  • Azure Firewall rules if traffic routes through firewall
  • Effective routes — traffic going where expected?
  • Public IP assigned and associated to NIC?
  • Network Watcher → Connection Troubleshoot / IP Flow Verify

4. Access VM (when network looks OK):

  • Azure Bastion — connect without public IP
  • Serial Console + Boot Diagnostics — check boot logs if SSH/RDP fails (disk full, fstab error, kernel panic)
  • VMAccess extension — reset SSH key or local admin password

5. Metrics (Azure Monitor):

  • CPU, memory, disk space (>90% can cause failures), network I/O
  • High CPU or full disk often explains "unreachable" symptoms

6. Recovery:

  • Restore from Azure Backup snapshot
  • Attach OS disk to rescue VM if corrupted
  • Redeploy VM keeping data disk

Root cause examples I've seen: NSG rule change, disk full, failed patch, wrong DNS mapping, expired SSH key."


21. How would you perform DR testing in Azure?

Senior Answer (8–9 Years)

"DR testing validates our RTO/RPO commitments — we run both tabletop exercises and technical failover tests on a scheduled cadence (quarterly/annually per compliance).

1. Plan & scope:

  • Define test scenario — regional outage, datacenter failure, ransomware recovery
  • Document expected RTO (how fast we recover) and RPO (how much data we can lose)
  • Notify stakeholders — test window, expected impact (ideally non-prod first)

2. Azure Site Recovery (ASR) — VM DR test:

  • Run Test Failover (non-destructive) — VMs spin up in DR region on isolated network
  • Validate application starts, connectivity, data consistency
  • Cleanup test failover — does not affect production replication
  • Document actual RTO vs target

3. Azure Backup restore test:

  • Restore a VM or managed disk from Recovery Services Vault to isolated resource group
  • Verify data integrity and restore time
  • Test Key Vault backup/restore for secrets recovery

4. AKS DR test (Velero):

  • Restore namespace from Velero backup in Blob to same or DR cluster
  • Validate pods, services, ingress come up
  • Test StatefulSet PV restore from disk-level Azure Backup

5. Infrastructure DR test (Terraform):

  • terraform apply in DR region from version-controlled code
  • Validates IaC can recreate full environment

6. Traffic cutover test:

  • Front Door / DNS failover to DR region endpoint
  • Validate end-to-end user flow

7. Post-test:

  • Document results, gaps, action items
  • Update runbooks with actual timings
  • Report to compliance/client for gov healthcare audit trail

At HHS: Combined ASR test failover + Velero namespace restore + Terraform DR apply — full stack validation."

Important Points

Test type Tool Non-destructive?
VM failover ASR Test Failover Yes — isolated network
K8s restore Velero Yes — restore to test cluster
Disk/VM restore Azure Backup Yes — isolated RG
Infra recreate Terraform Yes — DR region/subscription

Likely Follow-Up Questions

  • Difference between ASR test failover and committed failover?
  • What is RTO vs RPO?
  • How do you DR test AKS without affecting production?

22. Explain the complete AKS architecture.

Senior Answer (8–9 Years)

1. Control Plane (Microsoft-managed — free in standard tier):

  • API Server — all kubectl/Helm commands go here; stores cluster state
  • etcd — distributed key-value store for cluster state
  • Scheduler — assigns pods to nodes
  • Controller Manager — maintains desired state (replicas, endpoints)
  • We never manage these — Azure handles HA, patching, upgrades

2. Node Pools (we manage — VM Scale Sets):

  • System node pool — runs kube-system pods (CoreDNS, metrics-server, monitoring agents). Keep stable; don't run app workloads here
  • User node pools — segregated by workload:
    • Normal workloads — standard VM SKUs
    • Spot node pool — cheaper, fault-tolerant batch/CI jobs
    • GPU node pool — ML/GPU workloads (if needed)
  • Each node runs: kubelet, kube-proxy, containerd (container runtime)

3. DaemonSets on every node:

  • CNI plugin (Azure CNI) — assigns pod IPs from VNet subnet; VNet integration
  • CSI drivers — connect pods to Azure services:
    • Disk CSI → Azure Managed Disks (PVs)
    • Key Vault CSI → secrets as volumes
    • ACR pull via Managed Identity (not CSI but identity layer)

4. Networking & Ingress:

  • Private AKS in VNet subnets — no public API or nodes
  • AGIC — Application Gateway Ingress Controller routes external traffic to pods
  • Network policies — restrict pod-to-pod traffic

5. Identity & integrations:

  • Managed Identity on cluster + Workload Identity Federation for pods
  • ACR image pull, Key Vault secrets — no credentials in code

6. Add-ons we run: Prometheus/Grafana, Velero, AGIC, Key Vault CSI, OpenTelemetry collector"


23. How do you perform AKS cluster upgrades?

Senior Answer (8–9 Years)

Pre-upgrade (planning):

  1. Review Kubernetes release notes — deprecated APIs, breaking changes
  2. Check compatibility: Helm charts, Argo CD, ingress (AGIC), monitoring addons
  3. Run kubectl convert / API deprecation checks on manifests
  4. Confirm application runtimes support target K8s version — coordinate with developers
  5. Upgrade non-prod first; wait 1–2 weeks before production

Backup before upgrade:

  • Velero — namespace-level backup to Blob Storage
  • Disk snapshots — PVs / StatefulSets
  • Terraform (IaC) ready — can spin up new cluster in same or DR region if catastrophic failure
  • Drift detection pipeline — runs frequently; if infra drift detected, reconcile Terraform code before upgrade

Upgrade execution (order matters):

  1. Control plane firstaz aks upgrade --kubernetes-version x.y.z — API server briefly unavailable
  2. Node pools one by one — never all at once
    • Check PodDisruptionBudgets (PDB) and replica counts before drain
    • Use surge settings (maxSurge: 33%) — cordon → drain → new nodes → verify pods healthy
    • Only proceed to next node pool when all pods are running
  3. Add-ons / plugins last — AGIC, Key Vault CSI, Prometheus, Velero — upgrade to compatible versions

Post-upgrade validation:

  • kubectl get nodes — all Ready on new version
  • Smoke tests, Grafana dashboards clean, no CrashLoopBackOff
  • Rollback plan: new node pool on old K8s version + migrate workloads if needed

Azure supports last 3 K8s versions — upgrade within support window before version goes end-of-life."


24. How do you troubleshoot pods in CrashLoopBackOff state?

Senior Answer (8–9 Years)

Alert fires in Grafana — pod restart count or CrashLoopBackOff state.

Step 1 — Describe pod (always first):

kubectl describe pod <pod-name> -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Check Events section — exit code, OOMKilled, probe failure, CreateContainerConfigError

Step 2 — Previous container logs (if describe unclear):

kubectl logs <pod-name> -n <namespace> --previous

--previous is critical — current container may not have logs yet; previous shows why last crash happened

Step 3 — Identify root cause and fix:

Cause What to check Fix
App crash --previous logs, stack trace Fix code/config, rollback Helm release
Missing env vars / secrets describe → CreateContainerConfigError ConfigMap, Secret, Key Vault CSI mount
OOMKilled describe → OOMKilled, kubectl top pod Increase memory limits, fix memory leak
Liveness probe too aggressive describe → probe failures Increase initialDelaySeconds, failureThreshold
Wrong image / tag describe → ErrImagePull (different from CrashLoop) Fix image tag in deployment
Init container failure describe → init container status kubectl logs <pod> -c <init-container>
Permission / securityContext describe → permission denied Fix runAsUser, fsGroup, volume mounts

Step 4 — Mitigate: helm rollback or fix manifest → redeploy → verify pod Running in Grafana

Real example (Q17): ImagePullBackOff was auth issue — different state but same describe/logs workflow."


25. Explain HPA and Cluster Autoscaler.

Senior Answer (8–9 Years)

HPA (Horizontal Pod Autoscaler):

  • Scales number of pods in a Deployment/StatefulSet horizontally
  • Metrics Server (or Prometheus adapter) collects CPU/memory metrics
  • HPA compares average metric vs target threshold — e.g., CPU > 70%
  • Scales up (add pods) or down (remove pods) automatically
  • Scope: single workload (one Deployment)

Cluster Autoscaler:

  • Scales number of nodes in a node pool
  • Triggers when pods are Pending — cannot be scheduled because no node has enough CPU/memory
  • Adds nodes to node pool; removes underutilized nodes when pods can be rescheduled elsewhere
  • Scope: entire node pool

How they work together:

Traffic spike → HPA adds more pods
    → Pods go Pending (no node capacity)
    → Cluster Autoscaler adds nodes
    → Pods scheduled → service handles load

Bonus: KEDA for event-driven scaling — e.g., Azure Service Bus queue depth, HTTP rate — beyond CPU/memory."

Important Points

  • HPA scales pods; Cluster Autoscaler scales nodes — different layers
  • HPA uses Metrics Server (CPU/memory) or Prometheus adapter / KEDA for custom metrics
  • Cluster Autoscaler triggers on Pending pods (unschedulable), not node CPU utilization graphs
  • Flow: traffic ↑ → HPA adds pods → pods Pending → CA adds nodes
  • Configure minReplicas/maxReplicas (HPA) and min-count/max-count (node pool + CA)
  • KEDA — event-driven scale (Azure Service Bus queue depth, HTTP rate)

Likely Follow-Up Questions

Follow-up Answer
HPA not scaling? Metrics Server missing, wrong apiVersion, requests/limits not set on pods
CA not adding nodes? Pod limits, max node count reached, CA not enabled on pool
HPA vs VPA? HPA = horizontal (more pods); VPA = vertical (bigger pod resources)

26. A production application is running slow in AKS. How will you troubleshoot it?

Senior Answer (8–9 Years)

1. Triage — define scope:

  • All users or one state? All APIs or one endpoint? When did it start? Recent deployment?

2. Observability (your approach — expand):

  • Grafana dashboards — latency P95/P99, error rate, pod restart spikes
  • kubectl top pods — CPU/memory per pod; throttling?
  • kubectl top nodes — node pressure?
  • Namespace + pod events: kubectl get events -n <ns> --sort-by='.lastTimestamp'
  • Pod logs: kubectl logs <pod> -f — errors, timeouts
  • OpenTelemetry traces — which dependency is slow? (API → DB → cache)

3. Layer by layer:

Layer Check Fix
Ingress App Gateway / Front Door latency, 5xx WAF rule, backend pool health
Pods CPU/mem high → resource issue Scale HPA, increase limits, add nodes (Cluster Autoscaler)
App Logs — exceptions, thread pool exhaustion Rollback deploy, fix code
Database PostgreSQL vCore/connections, slow queries Scale DB, connection pool, escalate to DBA
Cache Redis hit ratio, evictions Increase Redis memory
Network Private Endpoint DNS latency Fix DNS, NSG
External API OTel dependency span slow Contact vendor, add timeout/circuit breaker

4. Mitigate first: Scale HPA, add nodes, rollback bad Helm release — restore SLA, RCA after.

5. Your quick path (valid for interview start):

Grafana alert → kubectl top pod → events → logs
  → CPU/mem high? → HPA scale / increase limits
  → DB slow in traces? → check PostgreSQL metrics
  → Recent deploy? → helm rollback
```"

Important Points

  • Start with scope (who, when, which API) and correlate with deploy
  • Grafana + kubectl top + events + logs + OpenTelemetry traces
  • Slowness is often database or ingress — not always pod CPU
  • Mitigate first: HPA scale, helm rollback, DB scale — then RCA

Likely Follow-Up Questions

Follow-up Answer
How find slow dependency? OpenTelemetry trace waterfall — longest span
CPU normal but slow? Check DB connections, external APIs, thread pool

27. Explain how VNets, Subnets, NSGs, and Route Tables work together.

Senior Answer (8–9 Years)

  • VNet: Top-level network boundary (e.g., 10.0.0.0/16)
  • Subnet: Segments within VNet (e.g., 10.0.1.0/24 for web, 10.0.2.0/24 for data). Resources get IPs from subnet range.
  • NSG: Attached to subnet or NIC — filters traffic (allow/deny) based on 5-tuple rules. Evaluated by priority.
  • Route Table (UDR): Attached to subnet — controls where traffic is routed (default system routes vs. custom: force tunnel through Firewall, route to NVA, BGP from ExpressRoute).

Flow example: Pod in AKS → subnet route table sends egress to Azure Firewall → NSG on subnet allows outbound 443 → Firewall applies application rules → internet.

Effective routes + effective security rules in Portal show the actual result of all combined rules.


Important Points

  • VNet = address space; Subnet = segment; resources get IPs from subnet
  • NSG = allow/deny traffic (L4); attached to subnet or NIC; rules by priority (5-tuple)
  • Route Table (UDR) = where packets go; overrides system routes
  • NAT Gateway / Firewall = outbound internet for private resources (SNAT / inspection)
  • NSG filtersRoute Table routes — do not confuse the two
  • Evaluate Effective routes + Effective security rules in Portal

Inbound: Browser → DNS → Front Door → App Gateway → AGIC → Ingress → Service → Pod

Outbound: Pod → Route Table → NSG → Firewall/NAT Gateway → Internet (NOT through AGIC)

Likely Follow-Up Questions

Follow-up Answer
NSG vs Firewall? NSG = distributed L4; Firewall = centralized L3–L7 + threat intel
AKS pod egress? Azure CNI assigns pod IPs — UDR on pod subnet controls egress path

28. Explain ExpressRoute and VPN connectivity.

Senior Answer (8–9 Years)

Aspect Site-to-Site VPN ExpressRoute
Path Encrypted over public internet (IPsec/IKE) Private dedicated connection via connectivity provider
Bandwidth Up to ~1.25 Gbps (VpnGw scale) 50 Mbps to 100 Gbps
Latency Variable Consistent, lower
Cost Lower Higher (provider + port fees)
Use case Branch offices, dev/test, backup Enterprise production, compliance, high throughput

"In hybrid setups I often use ExpressRoute as primary and VPN as backup (ExpressRoute + VPN coexistence). BGP propagates on-premises routes into Azure VNet. For hub-spoke, route tables in spokes point to hub NVA/Firewall for inspection."

Important Points

  • VPN (Site-to-Site): Public internet + IPsec/IKE tunnel → VPN Gateway on VNet
  • ExpressRoute: Private dedicated circuit via connectivity providerExpressRoute Gateway on VNet
  • Do not mix gateways — VPN Gateway ≠ ExpressRoute Gateway
  • VPN: lower cost, lower bandwidth (~1.25 Gbps max), good for branch/dev/failover
  • ExpressRoute: higher cost, 50 Mbps–100 Gbps, consistent latency, compliance/production
  • Together: ExpressRoute primary + VPN backup if ExpressRoute fails
  • BGP exchanges routes between on-prem and Azure automatically

Likely Follow-Up Questions

Follow-up Answer
ExpressRoute vs VPN for healthcare gov? ExpressRoute for prod/compliance; VPN as backup
What is BGP? Dynamic route exchange — on-prem networks advertised to Azure and vice versa

29. Explain your Azure DevOps pipeline architecture.

Senior Answer (8–9 Years)

Our Azure DevOps architecture has CI and CD separated.

CI: Trigger on merge to develop → checkout → GitLeaksTrivy FS → build → unit tests → SonarQube quality gate → Docker build → Trivy image scan → push to ACR with tag (buildId + git SHA).

CD: Promote same image tag through Dev (auto) → QA (tests) → Prod (manual approval). Deploy to AKS via Helm. Service connections use Workload Identity Federation — no secrets in YAML.

Argo CD handles GitOps where Helm values/manifests in Git auto-sync to cluster. Terraform runs in a separate pipeline for infrastructure.

Important Points

  • CI = build, test, scan, publish image | CD = deploy per environment
  • Never deploy latest to Prod — promote same tested ACR tag
  • GitLeaks (secrets) → Trivy (vulns) → SonarQube (quality gate) = DevSecOps chain
  • Helm = pipeline deploys | Argo CD = Git changes auto-sync (GitOps)
  • Terraform = separate IaC pipeline (plan on PR, apply on merge)
  • Workload Identity Federation for service connections — no PATs/secrets in YAML

Likely Follow-Up Questions

Follow-up Answer
SonarQube gate fails? Pipeline stops — fix code or get approved exception
Helm vs Argo CD? Helm = pipeline pushes; Argo CD watches Git repo and reconciles cluster state
How secure ACR push? Service connection + Managed Identity / Workload Identity Federation

30. Explain Blue-Green and Canary deployments.

Senior Answer (8–9 Years)

Blue-Green: Two identical environmentsBlue = live, Green = new version. On deploy, switch all traffic instantly from Blue to Green. If issues → switch back to Blue immediately. Fast rollback, but 2x cost during deploy.

Canary: Same cluster, two versions running. Route traffic gradually by percentage (5% → 25% → 50% → 100%) to the new version. Monitor metrics/errors. If problems → stop canary (only small % affected). On AKS: Ingress weights, Flagger, or Argo Rollouts.

Rolling update (AKS default): Replace pods incrementally. Control with maxSurge / maxUnavailable. PDB ensures minimum availability. Rollback: kubectl rollout undo or helm rollback.

Important Points

  • Blue-Green = instant switch (not split traffic)
  • Canary = percentage gradual (not two full environments)
  • Rolling = default K8s strategy — pod-by-pod replacement
  • Blue-Green rollback = switch traffic back | Rolling = rollout undo | Canary = set canary weight to 0%

31. How do you perform production rollbacks?

Senior Answer (8–9 Years)

Application rollback (AKS):

helm rollback <release> <revision> -n <namespace>
# or
kubectl rollout undo deployment/<name> -n <namespace>

Pipeline rollback: Redeploy last known-good build from ACR (immutable tags by git SHA, never latest in prod).

Database: Backward-compatible migrations only; use expand-contract pattern; rollback app before rolling back schema.

Infrastructure: terraform apply previous state from version control; never manual portal changes.

Process: Rollback is pre-documented in runbook; triggered when error rate > threshold for 5 min post-deploy; incident channel notified; post-mortem required."


32. Which monitoring tools have you integrated?

Senior Answer (8–9 Years)

"We run a three-pillar observability stack on AKS — metrics, logs, and traces — integrated through Grafana as the single pane of glass.

Metrics — Prometheus + Grafana:

  • Prometheus scrapes AKS node and pod metrics — CPU, memory, restarts, HTTP error rates
  • Grafana dashboards per service with alert rules for pod failures — CrashLoopBackOff, ImagePullBackOff, high restart count, node disk pressure
  • Alerts route to Teams/email for on-call

Logs — Loki + Promtail:

  • Promtail ships pod logs from every node to Loki
  • Grafana queries Loki for log drill-down when an alert fires

Traces — OpenTelemetry:

  • OTel SDK in microservices exports distributed traces
  • Identify latency bottlenecks across dependencies (API → DB → cache)

Azure-native (platform layer):

  • Azure Monitor + Log Analytics for platform and audit logs
  • Container Insights for AKS cluster health
  • Application Insights for APM where enabled
  • Azure Service Health — rule out regional outages first

Integration workflow at HHS: Alert fires in Grafana → check dashboard → drill Loki logs → OTel trace for latency → kubectl if Kubernetes issue. Correlating all three pillars reduced MTTR.

CI/CD: Azure DevOps deployment markers correlated with metric spikes post-release."

Important Points

Pillar Tool Integration
Metrics Prometheus → Grafana Scrape pods/nodes; alert rules → Teams
Logs Promtail → Loki → Grafana Log drill-down from alert
Traces OpenTelemetry → Grafana/backend End-to-end latency
Platform Azure Monitor, Container Insights AKS/node health, KQL queries

Likely Follow-Up Questions

  • How do Prometheus and Grafana connect? (ServiceMonitor / scrape config)
  • What happens when an alert fires? (dashboard → logs → traces → kubectl)
  • Difference between Container Insights and Prometheus?
  • How do you avoid alert fatigue?

33. What alerts do you usually configure in production environments?

Senior Answer (8–9 Years)

"In production we configure category-based alerts — each with a linked runbook. We deliberately reduced alert count so every alert is actionable — no noise.

Category Examples Severity
Availability CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending, HTTP 5xx rate P1 / Sev 1
Performance P95 latency > 2s, CPU > 85%, memory > 90% P2 / Sev 2
Capacity Disk > 80%, node pool near max, DB connection pool exhaustion P2
Security Key Vault access anomalies, failed login spikes, Defender for Cloud alerts P1–P2
Backup/DR Backup job failed, ASR replication lag P2
Certificate TLS cert expiry < 30 days P2
Cost Budget threshold 80%/100% P3

Alerts route to Teams/on-call. On fire → follow runbook → Grafana → logs → mitigate. We tune thresholds post-incident to keep false positives low."

Important Points

  • Every alert = runbook — no alert without an action
  • Availability = P1 — pod failures and 5xx directly impact users
  • Backup job failed ≠ CI/CD pipeline failure (say the right one in interview)
  • Tune regularly — alert fatigue kills on-call effectiveness

Likely Follow-Up Questions

  • How do you reduce alert fatigue?
  • What's P1 vs P2 in your org?
  • Example runbook for CrashLoopBackOff?

34. How do you monitor MongoDB performance?

Senior Answer (8–9 Years)

"Production uses MongoDB Atlas. We rely on Atlas-native monitoring — Performance Advisor, slow query logs, and Real-Time Performance Panel — plus alerts on connections, replication lag, and opcounters.

Key metrics we watch:

  • CPU, memory, disk usage
  • Active connections and connection pool exhaustion from apps
  • Replication lag (replica set health)
  • Slow queries and missing indexes — look for COLLSCAN (collection scans, not table scans — MongoDB uses collections)
  • Query execution times and index usage

Integration: MongoDB exporterPrometheusGrafana dashboards for unified observability alongside AKS metrics.

Lower environments: Same exporter pipeline; we also use mongostat / explain() on slow queries for troubleshooting.

Red flags: COLLSCAN on large collections, rising connection count, replication lag > threshold, working set exceeding RAM."

Important Points

Term Correct
MongoDB data unit Collections (not tables)
Missing index symptom COLLSCAN in query plan
Your stack Atlas + MongoDB exporter → Prometheus → Grafana

Likely Follow-Up Questions

  • What is COLLSCAN and how do you fix it?
  • How do you find slow queries in Atlas?
  • Connection pool exhaustion — what causes it?

35. Explain Redis replication and failover.

Senior Answer (8–9 Years)

"Redis replication creates an asynchronous copy of the primary for high availability and read scaling. The primary handles writes; replicas sync via PSYNC.

Lower environments — self-managed Redis:

  • Master + replica topology with Redis Sentinel
  • Sentinel monitors the primary; on failure it elects and promotes a replica, then apps reconnect to the new primary

Production — Azure Cache for Redis Premium:

  • Primary-replica replication with automatic failover (Azure-managed on Premium)
  • Zone redundancy for HA across availability zones
  • Private Endpoint — AKS pods connect over the VNet; Private DNS zone resolves the Redis hostname inside the cluster network

Monitoring: used_memory, evicted_keys, connected_clients, replication lag

App side: Connection multiplexer with retry/reconnect logic so failover is transparent to the application."

Important Points

Environment Setup Failover
Lower env Self-hosted + Sentinel Sentinel promotes replica
Production Azure Redis Premium + zone redundancy Azure automatic failover
Network Private Endpoint + Private DNS zone Pods in AKS VNet only

Likely Follow-Up Questions

  • Sync vs async replication — what can you lose on failover?
  • Difference between Sentinel and Redis Cluster?
  • How does Private Endpoint DNS work for Redis from AKS?

36. Explain your experience with Cloud Run, BigQuery, or Airflow.

Senior Answer (8–9 Years)

"My primary cloud is Azure, so most of my production work is AKS, Azure DevOps, and Azure data services.

Cloud Run and BigQuery: I don't have deep hands-on production experience with these. I understand the concepts — Cloud Run is serverless containers (similar to Azure Container Apps), and BigQuery is a serverless analytics warehouse (similar to Azure Synapse / serverless SQL patterns).

Airflow — I have practical experience:

  • Used to author, schedule, and monitor workflows
  • Workflows are defined as DAGs (Directed Acyclic Graphs) in Python code
  • Each DAG has tasks with dependencies — upstream must succeed before downstream runs
  • Supports retries, scheduling (cron), and a UI to track run status
  • Pipelines are version-controlled in Git — scalable and repeatable for ETL/batch jobs

The orchestration concepts transfer: Airflow DAG patterns map closely to Azure Data Factory pipelines and activity dependencies."

Important Points

GCP service If no hands-on Azure equivalent
Cloud Run Serverless containers Container Apps
BigQuery Analytics warehouse Synapse / analytics SQL
Airflow DAG-based orchestration Data Factory pipeline patterns

Interview rule: Be honest on gaps; show you know equivalents and own Airflow clearly.

Likely Follow-Up Questions

  • What is a DAG in Airflow?
  • How do task dependencies work?
  • Airflow vs Azure Data Factory?

37. How do you manage monitoring across Azure and GCP?

Senior Answer (8–9 Years)

"My production depth is Azure, but the multi-cloud pattern we follow — or would follow — is one observability stack, not two.

Unified OSS stack on both clouds:

  • Deploy the same Prometheus + Grafana + Loki (and ELK where needed) on AKS and GKE
  • Single pane of glass — Grafana dashboards with mixed data sources from both clouds

Standardized instrumentation:

  • OpenTelemetry in all microservices — self-hosted collector or export to Application Insights on Azure; same OTel pattern on GCP for trace correlation across services

Platform layer per cloud:

  • Azure: Azure Monitor, Container Insights, Cost Management — resources tagged (environment, team, cost-center) for segregation
  • GCP: Cloud Monitoring + Logging forwarded to the same central sink (or scraped into Prometheus)

Operations: One on-call rotation, one runbook, one escalation policy (Teams/PagerDuty) — avoid duplicating alert logic per cloud. Cloud-specific dashboards only where the platform differs.

Cost: Azure Cost Management + GCP Billing export — unified FinOps view via tags."

Important Points

Principle How
Same stack both clouds Prometheus/Grafana/Loki on AKS + GKE
Trace correlation OpenTelemetry everywhere
No duplicate alerts One runbook, one on-call
Segregation Consistent tags on Azure AND GCP

Likely Follow-Up Questions

  • How do you correlate traces across Azure and GCP services?
  • Grafana mixed data sources — how configured?
  • How avoid alert fatigue across two clouds?

38. How do you perform security hardening in Azure?

Senior Answer (8–9 Years)

"We apply layered security across Azure — identity, network, compute, data, governance, secrets, and CI/CD.

Identity: Enforce MFA, Conditional Access, PIM for admin roles — no standing privileged access, no shared accounts.

Network: WAF on App Gateway/Front Door, Azure Firewall for outbound control, NSGs on subnets, Private Endpoints for PaaS services, DDoS Protection on public-facing resources. No public IPs on production workloads where possible.

Compute / AKS: Private AKS cluster, ACR with trusted images only, Defender for Cloud image scanning, Pod Security Standards, Workload Identity / Managed Identity — no static credentials in pods.

Secrets: Key Vault for secrets, certs, and keys — accessed via Managed Identity, never in code or repos.

Data: Encryption at rest (CMK via Key Vault), TLS 1.2+ enforced, soft delete on storage and Key Vault.

Governance: Azure Policy — deny public storage, require tags, enforce HTTPS, block non-compliant resources at deploy time.

CI/CD (shift-left): GitLeaks, Trivy image scanning, SonarQube in Azure DevOps pipeline before anything reaches prod.

Monitoring: Defender for Cloud secure score, alerts on misconfigurations and anomalies.

Everything is codified in Terraform so security baseline is repeatable and drift is caught early."

Important Points

Layer Your HHS examples
Identity MFA, Conditional Access, PIM
Network WAF, Firewall, NSG, Private Endpoints
Compute Private AKS, ACR, Defender scanning
Secrets Key Vault + Managed Identity
Governance Azure Policy
CI/CD GitLeaks, Trivy, SonarQube

Likely Follow-Up Questions

  • What is PIM and why use it?
  • Private Endpoint vs Service Endpoint?
  • How does Workload Identity work with Key Vault?

39. Have you handled VA/PT findings? Give examples.

Senior Answer (8–9 Years)

"Yes. I participated in vulnerability scanning and penetration testing activities and owned infra-side remediation from scan reports.

Application / image scanning:

  • Trivy in CI pipeline flagged CVEs in container images → we updated base images and rebuilt pipelines
  • OWASP ZAP integrated in Jenkins pipeline for DAST on applications → shared findings with dev team for code fixes

Azure infrastructure findings and remediation:

  1. Soft delete not enabled on critical storage/Key Vault → enabled soft delete + purge protection
  2. NSG ports open to 0.0.0.0/0 → restricted to known IP ranges / Bastion, Azure Policy to deny wide-open rules
  3. Public endpoints where Private Endpoints were required → migrated PaaS to Private Endpoint + Private DNS
  4. TLS 1.0/1.1 on App Gateway → enforced TLS 1.2 minimum SSL policy, rescanned clean
  5. AKS secrets in plain Kubernetes Secrets (base64, not encrypted) → migrated to Key Vault CSI driver + Workload Identity
  6. Overprivileged users / service principals → scoped Azure RBAC to least privilege, custom roles instead of Contributor

Process: Triage by severity/CVSS → assign owner → remediate → rescan/retest → close ticket. Critical findings within agreed SLA."

Important Points

Term Azure interview
IAM Say Azure RBAC (not IAM — that's AWS)
K8s Secrets Base64 ≠ encryption — use Key Vault CSI
ZAP OWASP ZAP — DAST tool
Your tools Trivy (images), ZAP (apps), config review (infra)

Likely Follow-Up Questions

  • Difference between VA scan and PT?
  • How do you prioritize findings?
  • What is Key Vault CSI vs Kubernetes Secrets?

40. How do you secure AKS clusters?

Senior Answer (8–9 Years)

"Cluster access:

  • Private AKS cluster — API server inside the VNet, not publicly accessible
  • Microsoft Entra ID integration with Azure RBAC for Kubernetes — authorized access only, no shared kubeconfig

Network:

  • Network Policies — deny all by default, allow only required pod-to-pod traffic
  • WAF on Azure Front Door for ingress traffic to workloads behind the cluster

Identity & secrets:

  • Workload Identity Federation + Managed Identity — pods access Key Vault via CSI driver, no static credentials or plain Kubernetes Secrets

Images & supply chain:

  • ACR as trusted registry only; Defender for Cloud image scanning + Trivy in CI pipeline

Governance:

  • Azure Policy add-on for AKS — enforce no privileged containers, required labels, approved SKUs
  • Pod Security Admission (restricted/baseline)

Operations:

  • Regular Kubernetes version upgrades and CVE patching
  • Audit logs to Log Analytics"

Important Points

Your term Interview polish
Entra ID Correct — say Microsoft Entra ID + Azure RBAC for K8s
Federated identity Workload Identity Federation
Policy engine Azure Policy add-on for AKS
WAF + Front Door Valid — ingress layer for apps on AKS

Likely Follow-Up Questions

  • Private cluster — how do you run kubectl/CI/CD?
  • Workload Identity vs Managed Identity?
  • Network Policy example — deny all ingress?

41. How do you identify cloud cost leakages?

Senior Answer (8–9 Years)

"I use Azure Cost Management, Azure Monitor, and Azure Advisor to identify where money is being wasted.

How I find leakages:

  • Cost Analysis by resource and tag — spot spikes and unattached resources
  • Advisor recommendations — right-sizing VMs and databases, idle resources
  • Hunt for orphaned disks, snapshots, NICs, and public IPs still billing
  • Review non-prod running 24/7 — VMs and AKS node pools with no auto-scale-down
  • Check storage tiers — Premium disks or Hot blob where Cool/Archive fits

Remediation / prevention:

  • Right-size VMs and DB SKUs per Advisor
  • Remove orphan resources and extra snapshots
  • Cluster Autoscaler on AKS — scale nodes to demand, avoid idle capacity
  • VM auto start/stop schedules for off-hours (dev/test)
  • Blob lifecycle policies — move logs/backups to Cool/Archive
  • Reserved Instances / Savings Plans for predictable steady workloads
  • Tags via Azure Policy for cost allocation and showback

Cost leakage is usually orphaned resources and wrong SKU/tier — monthly FinOps review with Advisor + Cost Management."

Important Points

Identify (find waste) Prevent (fix waste)
Cost Analysis, Advisor Right-sizing, cleanup
Orphan disks/snapshots/IPs Delete + Policy guardrails
Idle non-prod 24/7 Auto-shutdown, Cluster Autoscaler
Wrong storage tier Lifecycle policies

Likely Follow-Up Questions

  • Difference between RI and Savings Plans? (Q43)
  • How does Cluster Autoscaler save money?
  • What tags do you enforce for FinOps?

42. What cost optimization initiatives have you implemented?

Senior Answer (8–9 Years)

  1. Reserved Instances / Savings Plans for stable AKS node pools and SQL — ~35% savings
  2. Spot node pools for CI/CD and batch workloads — ~70% compute savings on those pools
  3. Auto-shutdown schedules for dev/test VMs and scaled-down non-prod AKS at night
  4. Storage lifecycle policies — moved logs/backups to Cool/Archive tier — ~40% storage reduction
  5. Rightsizing via Azure Advisor recommendations — downsized 15 VMs
  6. Tagging enforcement via Policy — enabled showback/chargeback per team
  7. Consolidated Log Analytics workspaces to reduce ingestion duplication

43. Explain Reserved Instances and Savings Plans.

Senior Answer (8–9 Years)

Option Commitment Flexibility Discount Best For
Reserved Instances (RI) 1 or 3 year, specific VM family + region Low — exchange possible for different SKU/region Up to ~72% Stable, predictable workloads (fixed SQL tier)
Savings Plans 1 or 3 year, $/hour spend commitment High — change VM type, region, applies across compute Up to ~65% Mixed/evolving workloads (AKS node pools)

"Both are Azure cost-saving commitment options for 1–3 years.

Reserved Instances lock to a VM family and region — up to 72% discount. You can exchange RIs if needs change, but flexibility is limited.

Savings Plans commit to an hourly dollar spend — up to 65% discount — and flex across VM types and regions. Better when SKUs change, like AKS node pool upgrades.

I analyze 30–90 day utilization in Cost Management before purchasing. Savings Plans are my default for AKS; RIs for steady SQL/Cosmos tiers. Combined with Spot pools for CI/batch, we reduce overall compute cost significantly."

Important Points

  • Savings Plan = $/hour commit, flexible VM family
  • RI = specific SKU, higher discount, less flexible
  • AKS → prefer Savings Plans (node SKUs evolve)
  • Always check 30–90 day usage before committing

Likely Follow-Up Questions

  • Can you use RI and Savings Plan together?
  • What happens if you under-utilize the commitment?
  • Spot vs Savings Plan for AKS?

44. How do you optimize storage and idle resources?

Senior Answer (8–9 Years)

"Storage optimization:

  • Blob lifecycle policies — move logs/backups Hot → Cool → Archive automatically
  • Delete orphaned disks, snapshots, and old unwanted archives
  • Right redundancy tier — LRS for non-critical, ZRS for HA

Idle resource optimization:

  • Use Azure Monitor and Advisor to find unused VMs, disks, public IPs, NICs
  • Deallocate/remove idle VMs and associated disks
  • Auto-shutdown schedules for dev/test VMs off-hours
  • Scale down non-prod AKS node pools when not in use
  • Monthly runbook to clean orphaned storage and unattached resources"

Important Points

  • Monitor + Advisor = find waste
  • Lifecycle policies = prevent storage creep
  • Orphan cleanup = disks, snapshots, NICs, IPs

45. Database response time has increased significantly. What will be your approach?

Senior Answer (8–9 Years)

"I follow a structured top-down approach — confirm scope, check infra, then drill into the database.

1. Confirm & correlate:

  • When did it start? Recent deploy, config change, or Key Vault rotation?
  • Check Grafana/App Insights — latency spike on DB dependency span in OpenTelemetry traces
  • Rule out AKS/network — pod CPU, connection errors, Private Endpoint issues

2. Database metrics (platform layer):

  • Azure SQL / Cosmos: DTU/vCore utilization, connection count, deadlocks, wait stats
  • MongoDB Atlas: Performance Advisor, slow query logs, opcounters, replication lag, connections
  • Postgres: active connections, lock waits, pg_stat_statements

3. Query analysis:

  • Identify slow queries — missing indexes, COLLSCAN, full table scans
  • Run explain() / Query Performance Insight
  • Check if app released a bad query or N+1 pattern after deploy

4. Resource & capacity:

  • CPU/memory/IO saturation on DB tier — need scale-up?
  • Connection pool exhaustion from AKS pods — too many replicas or leak?
  • Redis cache miss rate — is load hitting DB directly?

5. Mitigate & escalate:

  • Short-term: scale DB tier, kill long-running queries, enable read replica
  • Engage DBA team for schema/index fixes
  • Document RCA — index added, query optimized, pool size tuned

At HHS: Platform owns infra metrics and connectivity; DBA owns schema/query tuning — I bridge both with traces and connection troubleshooting."

Important Points

Layer Check
App/trace OTel — is DB span the bottleneck?
Connectivity Private Endpoint, NSG, connection string
DB metrics DTU/vCore, connections, replication lag
Queries Slow query log, missing indexes, COLLSCAN
Cache Redis hit rate — bypass cache?

Likely Follow-Up Questions

  • How find slow query in MongoDB Atlas?
  • Connection pool exhaustion — symptoms?
  • Scale-up vs scale-out for Azure SQL?

46. How do you prioritize multiple production issues?

Senior Answer (8–9 Years)

"I use impact × urgency matrix:

Priority Criteria Example
P1 Full outage, revenue/security impact Payment down, data breach
P2 Major degradation, workaround exists Slow checkout, single region
P3 Minor impact, few users Non-critical report failing
P4 Cosmetic / planned UI alignment issue

Rules:

  • P1 always wins — all hands, incident commander assigned
  • If two P1s: business impact (revenue > internal tools)
  • Communicate reprioritization to stakeholders
  • Don't context-switch — assign owners per incident
  • Post-incident: review if alerts should have fired earlier"

47. Describe a critical production issue that you resolved.

Senior Answer (8–9 Years)

Situation: At HHS, after a Terraform apply that recreated cluster identity resources, multiple microservices entered ImagePullBackOff — pods could not pull container images from ACR. This was a P1 — production workloads for state government clients were failing to start. Grafana alert fired on pod restart count.

Task: As on-call platform engineer, I needed to restore image pull capability and get pods running — fast.

Action:

  • Joined Teams war room, confirmed no Azure regional outage (Service Health clean)
  • kubectl describe pod → Events showed 401/403 unauthorized pulling from ACR
  • Correlated with recent Terraform applyManaged Identity was recreated but AcrPull role assignment was missing
  • Fixed RBAC in Terraform — re-applied AcrPull on the kubelet identity → ACR
  • Triggered pod restart / redeploy → verified images pulling and pods Running
  • Notified client team of status and resolution

Result:

  • Service restored in ~10 minutes
  • Added Terraform validation step to verify MI role assignments post-apply
  • Added Grafana alert for ImagePullBackOff with linked runbook
  • Documented RCA in ticket; shared learnings in post-mortem

Lesson: Infrastructure changes that touch identity must include RBAC validation — ACR pull depends on AcrPull role on the cluster's Managed Identity."

Important Points

  • Use your real story — ImagePullBackOff, not a generic outage
  • STAR format: Situation → Task → Action → Result
  • Root cause: Terraform + missing AcrPull — shows depth

Likely Follow-Up Questions

  • How did you diagnose it was ACR auth and not a bad image tag?
  • What is AcrPull role?
  • How prevent this in future?

SECTION 3: QUICK-FIRE CHEAT SHEET


Kubernetes Commands to Know

kubectl get pods -A -o wide
kubectl describe pod <name> -n <ns>
kubectl logs <pod> -n <ns> --previous -f
kubectl top pods/nodes -A
kubectl rollout status deployment/<name>
kubectl rollout undo deployment/<name>
kubectl exec -it <pod> -n <ns> -- /bin/sh
kubectl get events -n <ns> --sort-by='.lastTimestamp'

Useful KQL Snippets

// Failed requests last hour
requests | where timestamp > ago(1h) and success == false | summarize count() by resultCode, name

// Exceptions spike
exceptions | where timestamp > ago(1h) | summarize count() by type, outerMessage

// Pod restarts
KubePodInventory | where PodRestartCount > 0 | project Computer, Name, PodRestartCount

Azure CLI Essentials

az aks get-credentials --resource-group <rg> --name <cluster>
az vm list-usage --location eastus
az backup recoverypoint list --vault-name <vault> --container-name <vm>
az network watcher test-connectivity --source-resource <vm> --dest-address <ip>
az aks upgrade --resource-group <rg> --name <cluster> --kubernetes-version 1.28.5

SECTION 4: QUESTIONS TO ASK THE INTERVIEWER

  1. What does the Azure landing zone / platform team structure look like?
  2. What is the current split between IaaS, PaaS, and AKS workloads?
  3. How mature are CI/CD and GitOps practices today?
  4. What does the on-call rotation and incident management process look like?
  5. Are there multi-cloud requirements (Azure + GCP/AWS)?
  6. What are the top 3 platform challenges you're hiring this role to solve?
  7. How does the team approach FinOps and cost accountability?
  8. What does success look like in the first 90 days?

SECTION 5: PRE-INTERVIEW CHECKLIST

  • Customize all [PLACEHOLDER] and sample company references
  • Prepare 2–3 STAR stories (incident, cost optimization, migration)
  • Be ready to whiteboard hub-spoke + AKS architecture
  • Review Azure Service Health for any recent outages (shows awareness)
  • Have salary expectations and notice period ready
  • Test video/audio if virtual interview
  • Keep this doc open on second monitor for quick reference (don't read verbatim!)

SECTION 6: FOLLOW-UP QUESTION BANK (Senior Cross-Questions)

Questions interviewers ask after your main answer. Prepare 2–3 sentences each.

Q1–3: Intro / Architecture

Follow-up Senior answer hint
Why Azure over AWS for this project? Gov compliance, client requirement, AKS + Azure DevOps integration, existing Entra ID
What was your biggest architecture challenge? Private AKS + PE for all PaaS + multi-state latency via Front Door
How many environments? Dev, QA, Prod — separate namespaces/subscriptions, promotion via pipeline

Q4–7: Azure Services / AKS

Follow-up Senior answer hint
Private vs public AKS cluster? Private API server, private nodes, Bastion for admin, no public kube API
How do pods pull from ACR privately? PE on ACR + Managed Identity AcrPull + Private DNS
Node pool failed to provision? Quota, subnet full, NSG blocking, SKU unavailable — check az aks nodepool list events
AKS upgrade failed mid-way? Surge nodes, cordon/drain, rollback node pool to previous K8s version

Q8–10: Kubernetes / CI/CD

Follow-up Senior answer hint
Deployment vs StatefulSet? Deployment = stateless; StatefulSet = stable identity + PV for DB
How do you rollback a bad deploy? helm rollback, same ACR tag, never deploy latest in prod
SonarQube quality gate failed — what now? Fix code or get exception approval; never bypass gate in gov prod
GitLeaks found a secret? Revoke key in Azure, rotate in Key Vault, rewrite git history or invalidate commit

Q11–12: Networking

Follow-up Senior answer hint
NSG vs Azure Firewall? NSG = L3/L4 subnet/NIC rules; Firewall = centralized L3–L7 app rules, IDPS
PE vs Service Endpoint? PE = private IP in VNet; Service Endpoint = VNet routing, service still has public IP
DNS not resolving for PE? Private DNS Zone not linked to VNet, or AKS using public DNS forwarder

Q5–6: Backup / DR

Follow-up Senior answer hint
Where are VM backups stored? Recovery Services Vault — not Blob (Velero uses Blob)
RTO vs RPO? RPO = max data loss window; RTO = max downtime; define per tier
Tested restore ever? Yes — GitLab VM restore drill, Velero namespace restore to QA

Extra Senior Questions (Added — May Be Asked)

# Question Why it matters
E1 Explain Workload Identity Federation step-by-step You mentioned OIDC — they will probe
E2 How does AGIC route traffic? App Gateway → AGIC → Ingress → Service → Pod
E3 Terraform state locking — how? Blob backend + native state lock
E4 How do you secure pipeline service connections? Workload Identity Federation, no secrets in YAML
E5 What Azure Policy do you apply on AKS? No privileged containers, required labels, allowed images
E6 How do you handle secrets in Git? GitLeaks in CI, Key Vault CSI in cluster, never commit
E7 Difference between LRS, GRS, ZRS? Redundancy tiers for Storage and Backup vault
E8 How do you do zero-downtime deploy on AKS? Rolling update + readiness probes + Helm maxUnavailable

Practice Senior Answer sections aloud. Use Important Points as a checklist before each mock question.