Azure & DevOps Interview Preparation Guide

🎯 SENIOR INTERVIEW MODE (Read Before Every Session)

You are preparing for a senior Azure/DevOps role (8–9 years expected depth). Your real experience is ~6 years — that is enough if you demonstrate senior-level thinking: architecture, security, trade-offs, failure scenarios, and cross-team ownership.

How Senior Answers Differ from Junior

Junior (❌)	Senior (✅)
"Private endpoint uses private IP"	Same + Private Link, Private DNS Zone, disable public access, NSG on PE subnet, gov compliance reason
Lists tools	Explains why you chose them and what breaks without them
One-line definitions	Definition + HHS example + follow-up readiness
"I managed AKS"	"I owned AKS platform — create, secure, upgrade, DR, cost, incidents"

Document Format (per question)

Section	Purpose
Senior Answer (8–9 Years)	Complete interview-ready response — practice this
Important Points	Terms, order, and traps — do not skip in the interview
Likely Follow-Up Questions	Cross-questions interviewers ask next

Customize answers with HHS / AKS / 30+ microservices where relevant. Do not claim tools you have not used.

See Follow-Up Question Bank at the bottom of this doc.

Your Profile (Mock Interview — Live Updates)

Item	Your Details
Name	X
Interview target	Senior Azure DevOps / Platform Engineer (8–9 yr depth)
Real experience	~6 years hands-on — frame as senior-scope ownership
Cloud	Azure (primary) + AWS
Certs	AZ-900, AWS Solutions Architect
Recent project	HHS — healthcare, US state gov (FL, IA, SD), 30+ microservices on AKS
Core work	Platform ownership: AKS, Terraform, Azure DevOps, DR, security, FinOps
Company	HHS
Job title	DevOps / Cloud Engineer (platform owner on gov healthcare)

How to Use This Document

Practice aloud — aim for 60–90 seconds per intro question, 2–3 minutes for architecture questions.
Use STAR (Situation, Task, Action, Result) for behavioral and incident questions.
Mock interview progress is tracked below — one question at a time.

Mock Interview Progress

#	Question	Status
1	Introduce yourself	✅
2	Years in Azure & DevOps	✅
3	Recent project architecture	✅
4	Azure services worked on	✅
5	VMs, Storage, Backup	✅
6	DR & Azure Backup	✅
7	AKS clusters	✅
8	Pods, Deployments, Services	✅
9	CI/CD tools	✅
10	Dev to Prod deployment	✅
11	VNets and NSGs	✅
12	Public vs Private Endpoints	✅
13	Monitoring tools	✅
14	Troubleshoot production	✅
15	Databases	✅
16	DBA activities	✅
17	Production incidents	✅
18	Client coordination	✅
19	Azure environment architecture	✅
20	Unreachable VM troubleshooting	✅
21	DR testing in Azure	✅
22	Complete AKS architecture	✅
23	AKS cluster upgrades	✅
24	CrashLoopBackOff troubleshooting	✅
25	HPA and Cluster Autoscaler	✅
26	Slow AKS app troubleshooting	✅
27	VNet/NSG/Route Tables	✅
28	ExpressRoute and VPN	✅ Passed
29	Azure DevOps pipeline architecture	✅ Passed
30	Blue-Green and Canary	✅ Passed
31	Production rollbacks	✅ Passed
32	Monitoring tools integrated	✅ Passed
33	Production alerts configured	✅ Passed
34	MongoDB performance monitoring	✅ Passed
35	Redis replication and failover	✅ Passed
36	GCP Cloud Run / BigQuery / Airflow	⏳ (repeat at end)
37	Multi-cloud monitoring (Azure + GCP)	✅ Passed
38	Security hardening in Azure	✅ Passed
39	VA/PT findings	✅ Passed
40	Secure AKS clusters	✅ Passed
41	Cloud cost leakages	✅ Passed
42	Cost optimization initiatives	⏭️ Skipped (same as Q41)
43	Reserved Instances & Savings Plans	✅ Passed
44	Storage & idle resource optimization	✅ Passed
45	Database response time troubleshooting	✅ Passed
46	Prioritize multiple production issues	✅ Passed
47	Critical production issue (STAR)	⏳

SECTION 1: INTRODUCTION & BACKGROUND

Document format: Each question has a Senior Answer, Important Points, and Likely Follow-Ups. No raw mock transcripts — customize with your HHS experience when practicing.

1. Please introduce yourself and explain your current role and responsibilities.

Senior Answer (8–9 Years)

"Good morning. I'm X, a DevOps Engineer with six years of experience building secure and scalable infrastructure on Azure and AWS, with my primary focus on Microsoft Azure.

In my role at HHS, I worked on a healthcare platform for US state government clients — Florida, Iowa, and South Dakota. I owned end-to-end infrastructure: provisioning cloud resources, deploying applications on AKS, and day-to-day Kubernetes cluster operations — including upgrades, monitoring, and troubleshooting.

My responsibilities fall into four areas:

Platform & Kubernetes: AKS cluster maintenance, node pool updates, ingress, and workload deployments

CI/CD & Automation: Pipelines for 30+ microservices in Azure DevOps and Jenkins — including automated tasks like restoring non-production Postgres data from production to dev/stage environments on demand

Operations: Incident response, cluster health monitoring, and fast troubleshooting during production issues

Security & Compliance: Government healthcare compliance requirements and secure pipeline practices

I hold AZ-900 and AWS Solutions Architect certifications. I'm looking to bring my hands-on AKS and Azure DevOps experience into a role where I can own platform reliability at scale.

Thank you."

2. How many years of experience do you have in Azure and DevOps?

Senior Answer (8–9 Years)

"I have six years of experience in DevOps and cloud engineering, working across both Azure and AWS.

Azure: Approximately four years of hands-on experience — including VMs, VNets, NSGs, Application Gateway, AKS, ACR, and Azure DevOps pipelines. At HHS, my recent production work has been primarily on Azure — deploying and operating 30+ microservices on AKS using Azure DevOps for CI/CD.

AWS: I also have solid AWS exposure from earlier projects — EC2, S3, IAM, and pipeline integrations — which helps when working in multi-cloud environments.

My growth path:

Early years: Git, Jenkins, basic cloud VMs, CI/CD fundamentals

Mid years: Infrastructure as Code with Terraform, containerization with Docker

Recent years: Deep Azure focus — AKS cluster operations, Azure DevOps repos and pipelines, RBAC for team access, and government healthcare compliance

So while I'm comfortable on both clouds, my deepest recent production experience is on Azure — especially AKS, Azure DevOps, and day-to-day platform operations. I'm fully comfortable owning infrastructure through deployment and production support."

3. Describe your recent project architecture.

Senior Answer (8–9 Years)

"My recent project at HHS is a healthcare platform for US state governments — Florida, Iowa, South Dakota — running 30+ microservices on Azure.

1. Infrastructure (IaC): Everything is provisioned with Terraform — VNet, subnets, AKS, ACR, Application Gateway, Front Door, Key Vault, and supporting resources.

2. Networking: We use a VNet with private subnets. AKS runs on private nodes — no direct public exposure. Azure Front Door is the global entry point, then traffic flows to Application Gateway (WAF), which routes to the cluster via the Ingress Controller (AGIC).

3. Traffic flow:
User (Browser)
  → Azure DNS
  → Azure Front Door (CDN/WAF/global routing)
  → Application Gateway (WAF, SSL termination)
  → Ingress Controller (inside AKS)
  → Kubernetes Services
  → Pods (30+ microservices)
  → PostgreSQL (operational DB)
4. CI/CD (Azure DevOps): On pipeline trigger → code passes DevSecOps checks → Docker image built and pushed to ACR → deployed to AKS via Helm/kubectl. Secrets come from Azure Key Vault using service connections and Managed Identity — never stored in code.

5. Monitoring: Prometheus + Grafana inside the cluster for metrics; logs and operational DB monitoring for health and troubleshooting.

6. DR/Backup: VM snapshots, Azure Backup, and Site Recovery for critical components.

Environments: Dev → Stage → Production with separate namespaces/subscriptions and promotion via Azure DevOps pipelines."

4. Which Azure services have you worked on extensively?

Senior Answer (8–9 Years)

"At HHS, I've worked hands-on across the full Azure stack for a healthcare platform with 30+ microservices:

Compute & containers: AKS (private clusters, node pools, upgrades), Azure VMs (early phase + GitLab), Azure Functions (VM auto-shutdown for cost)

Networking: VNet, subnets, NSGs, route tables, Azure Front Door, Application Gateway (WAF), AGIC, Private Endpoints, Private DNS zones, Azure Firewall, Azure Bastion

DevOps & registry: Azure DevOps (CI/CD), ACR, Argo CD (GitOps), Terraform (IaC)

Security & identity: Key Vault, Microsoft Entra ID, Managed Identity, Workload Identity Federation, Azure RBAC, Azure Policy, Defender for Cloud

Monitoring: Azure Monitor, Log Analytics, Container Insights, Application Insights — plus Prometheus/Grafana/OpenTelemetry on AKS

Storage & backup: Blob Storage, managed disks, Azure Backup, Recovery Services Vault, Azure Site Recovery

Data services: Azure Database for PostgreSQL, MongoDB Atlas, Azure Cache for Redis Premium

I provision and operate most of these via Terraform — not ad-hoc portal clicks."

Important Points

Group by category (compute, network, security, data) — shows breadth
Lead with AKS — your core strength
Mention Terraform — shows IaC discipline

Likely Follow-Up Questions

Which service do you know deepest? (AKS + networking)
How do Private Endpoints work with AKS?
Difference between Front Door and Application Gateway?

5. Have you managed Azure VMs, Storage Accounts, and Backup services?

Senior Answer (8–9 Years)

"Yes, I have managed all three extensively — especially in our early VM-based phase before we moved to AKS.

Azure VMs:

Hosted applications directly on VMs and self-hosted GitLab on a dedicated VM

Ran Docker workloads on VMs before containerizing to AKS

Built Azure Functions to automatically shut down dev/test VMs during off-peak hours (nights/weekends) for cost savings

Connected via Azure Bastion — no public RDP/SSH

Backup & Snapshots:

Configured daily VM backups via Azure Backup — e.g., GitLab VM backed up daily after business hours

Took disk snapshots before changes and for restore drills

Stored backups in Recovery Services Vault with defined retention

Storage Accounts:

Used Azure Blob Storage for application data and file storage

Stored Terraform remote state in Azure Blob — team-shared, version-controlled infrastructure

Configured appropriate access with RBAC and private access where needed

So yes — full lifecycle: provision VMs → backup/snapshot → cost optimize → migrate to AKS when ready."

6. Have you configured Disaster Recovery or Azure Backup?

Senior Answer (8–9 Years)

"Yes — I've configured both Azure Backup and Azure Site Recovery (ASR), plus AKS-specific backup with Velero.

Azure Backup (VMs & Disks):

Point-in-time restore with scheduled backup policies

Retention: typically 30–90 days based on compliance and RPO requirements

Frequency: weekly by default; daily or hourly for high-change or critical workloads

Stored in Recovery Services Vault (Microsoft-managed — not a Storage Account we pick)

Used for VMs (e.g., self-hosted GitLab) and managed disk backups for StatefulSet PVs

Azure Site Recovery (DR):

Replicates VMs from primary region to secondary region

On disaster → test failover or committed failover → services continue in DR region

Combined with Azure Front Door / DNS cutover for traffic routing

AKS Backup (Velero):

Namespace-level incremental backups — twice weekly

Backup storage in Azure Blob Storage

Can restore namespaces to same or another region

PV/PVC (StatefulSets): disk-level backup via Azure Backup on underlying managed disks

Key Vault: Yes — Key Vault supports backup and restore of secrets, keys, and certificates (az keyvault backup). We also enable soft delete and purge protection for compliance.

Full DR strategy: Terraform recreates infrastructure in DR region → restore Velero (K8s resources) from Blob → restore disks/VMs from Azure Backup → ASR failover for replicated VMs → DNS/Front Door cutover."

7. Have you created or managed AKS clusters?

Senior Answer (8–9 Years)

"Yes — AKS is my primary platform at HHS. I own full cluster lifecycle for 30+ microservices across dev, stage, and production.

Provisioning: Clusters provisioned via Terraform — private cluster, VNet-integrated, Azure CNI, separate system and user node pools.

Day-to-day operations:

Cluster and node pool upgrades (Kubernetes version, node image)

Node pool management — standard pools for apps, Spot pools for CI/batch

Ingress via AGIC + Application Gateway + Front Door

Add-ons: Prometheus/Grafana, Velero, Key Vault CSI, OpenTelemetry collector

Security & identity:

Private cluster — API server not public

Workload Identity Federation + Managed Identity for ACR pull and Key Vault access

Network policies, Azure Policy add-on, Pod Security Standards

Deployments: Helm via Azure DevOps pipelines + Argo CD GitOps sync

Backup: Velero namespace backups to Blob; disk-level backup for StatefulSet PVs

Troubleshooting: CrashLoopBackOff, ImagePullBackOff, HPA/Cluster Autoscaler tuning, performance issues — daily operational work."

Important Points

Private cluster + Terraform = senior signals
Separate system vs user node pools
Identity: Workload Identity not service principal secrets

Likely Follow-Up Questions

How do you upgrade AKS without downtime?
System vs user node pool — why separate?
How do pods pull images from ACR without credentials?

8. What are Pods, Deployments, Services, and Namespaces?

Senior Answer (8–9 Years)

Pod: Smallest deployable unit in Kubernetes. One or more containers sharing network and storage. Ephemeral — if it dies, Kubernetes creates a new one.

Deployment: Controller that manages pod replicas, rolling updates, and rollbacks. You declare desired state (e.g., 3 replicas); Deployment keeps it.

Service: Stable network endpoint for pods (which have changing IPs). Routes traffic via label selectors. Types:

ClusterIP — internal only (most common)

NodePort — exposes on node port

LoadBalancer — external access via cloud LB (e.g., Azure Load Balancer)

Namespace: Logical isolation within a cluster — separate dev/staging/prod, RBAC, quotas, and network policies.

One-liner: Pod runs containers → Deployment manages pods → Service gives stable address → Namespace isolates workloads.

Bonus (if asked): StatefulSet for stateful apps like databases — stable pod names (pod-0, pod-1), ordered deployment, and PersistentVolumes for storage."

9. Which CI/CD tools have you used?

Senior Answer (8–9 Years)

"My primary CI/CD platform is Azure DevOps — pipelines for 30+ microservices at HHS. I also have experience with Jenkins (OWASP ZAP DAST integration) and Argo CD for GitOps.

Azure DevOps (main):

CI: Build → test → GitLeaks (secrets) → Trivy (vulnerabilities) → SonarQube (quality gate) → Docker build → push to ACR

CD: Promote same image tag (buildId + git SHA) through Dev → Stage → Prod with manual approval on production

Deploy to AKS via Helm; service connections use Workload Identity Federation

Argo CD (GitOps):

Watches Git repo for manifest/Helm value changes → auto-syncs to cluster

Separates what runs (Git) from how it's built (pipeline)

Terraform (IaC pipeline):

Separate pipeline — terraform plan on PR, terraform apply on merge

Provisions VNet, AKS, Key Vault, App Gateway, etc.

Jenkins: Used for application DAST scanning with OWASP ZAP and reporting to dev team.

Image tagging rule: Never latest in production — immutable tags by git SHA."

Important Points

Tool	Role
Azure DevOps	CI/CD — build, scan, deploy
Argo CD	GitOps — cluster sync from Git
Terraform	IaC — infrastructure pipeline
Jenkins	DAST / legacy pipelines

Likely Follow-Up Questions

Helm vs Argo CD — when use which?
What happens if SonarQube gate fails?
How do you secure pipeline service connections?

10. Explain your deployment process from Dev to Production.

Senior Answer (8–9 Years)

"We follow a promotion-based pipeline across Dev → Stage → Production with separate namespaces and environment-specific configuration.

1. Developer flow:

Feature branch → PR → merge to develop

CI pipeline triggers automatically

2. CI (build once, deploy many):

Checkout → GitLeaks → Trivy filesystem scan → build → unit tests → SonarQube quality gate

Docker build → Trivy image scan → push to ACR with tag buildId-gitSHA

Same image is promoted — never rebuild per environment

3. CD — environment promotion:

Dev: Auto-deploy on CI success — smoke test

Stage/QA: Integration tests, regression, performance checks

Production: Manual approval gate → deploy via Helm to AKS

Argo CD syncs Git-managed manifests where GitOps is enabled

4. Configuration per environment:

Helm values / ConfigMaps differ per env (replicas, endpoints, feature flags)

Secrets from Key Vault via CSI driver — never in Git or pipeline YAML

5. Post-deploy verification:

Health probes pass, Grafana dashboards clean, smoke test

Rollback ready: helm rollback or redeploy last-known-good ACR tag

6. Infrastructure changes:

Terraform in separate pipeline — plan on PR, apply on merge — never manual portal changes

Gov healthcare: Change windows, client notification for prod releases, audit trail in Azure DevOps."

Important Points

Build once, promote same tag — critical senior principle
Manual approval for production
Secrets from Key Vault, not pipeline variables
Terraform separate from app CD

Likely Follow-Up Questions

How do you handle database migrations across environments?
What if Stage passes but Prod fails?
How do you rollback production?

11. What are VNets and NSGs?

Senior Answer (8–9 Years)

VNet (Virtual Network): A private, logically isolated network in Azure. You define an IP address range (CIDR) — e.g., 10.0.0.0/16 — and divide it into subnets. Resources like AKS nodes, VMs, and Application Gateway get private IPs from these subnets. VNets can peer with other VNets and connect on-premises via VPN/ExpressRoute.

NSG (Network Security Group): A firewall rule set that controls inbound and outbound traffic. Attached to a subnet or network interface (NIC). Rules define source/destination IP, port, protocol, and allow/deny — evaluated by priority. Example: allow 443 from App Gateway subnet; deny all other inbound.

Together: VNet = network layout; NSG = traffic filtering. In our HHS project, AKS runs in a private subnet with NSGs controlling what traffic can reach the cluster.

12. What is the difference between Public and Private Endpoints?

Senior Answer (8–9 Years)

"Public Endpoint: The Azure PaaS service is reachable over the public internet via a public hostname/IP — e.g., myaccount.blob.core.windows.net. Traffic traverses the internet. You rely on firewall rules, SAS tokens, or service-level auth.

Private Endpoint: A private IP from your VNet subnet is assigned to the PaaS service via Azure Private Link. Traffic stays on the Microsoft backbone — never exposed to the public internet. DNS resolves to the private IP via a Private DNS zone.

When we use Private Endpoints (production at HHS):

ACR — pods pull images without public registry access

Key Vault — secrets accessed only from VNet

Blob Storage, PostgreSQL, Redis Premium — no public exposure

Key differences:

Public Endpoint Private Endpoint

Network path Internet Microsoft backbone → VNet

IP Public Azure IP Private IP in your subnet

DNS Public FQDN Private DNS zone required

Security Firewall + auth rules Network isolation by default

Use case Dev/test, public APIs Production, compliance (healthcare)

Service Endpoint vs Private Endpoint: Service Endpoint routes VNet traffic to Azure service but the service still has a public endpoint. Private Endpoint is stronger isolation — dedicated private IP per service.

At HHS, all production PaaS uses Private Endpoints — required for government healthcare compliance."

	Public Endpoint	Private Endpoint
Network path	Internet	Microsoft backbone → VNet
IP	Public Azure IP	Private IP in your subnet
DNS	Public FQDN	Private DNS zone required
Security	Firewall + auth rules	Network isolation by default
Use case	Dev/test, public APIs	Production, compliance (healthcare)

Important Points

Private Endpoint = Private Link + Private DNS zone
Production healthcare → Private Endpoints on all PaaS
Don't confuse Service Endpoint with Private Endpoint

Likely Follow-Up Questions

How does AKS pod resolve Key Vault DNS with Private Endpoint?
Can you use both public and private on same storage account?
What is Private Link?

13. Which monitoring tools have you used?

Senior Answer (8–9 Years)

"We use a three-pillar observability stack on Azure — metrics, logs, and traces:

Metrics — Prometheus + Grafana:

Prometheus scrapes AKS pod/node metrics (CPU, memory, restarts, HTTP rates)

Grafana dashboards per team/service with alert rules — e.g., pod restart count, latency P95, node disk pressure

Alerts route to email/Teams; on-call checks dashboard and logs

Logs — Elasticsearch + Loki:

Loki for Kubernetes pod logs; Elasticsearch for search and log parsing

Track application events, errors, and audit trails for gov compliance

Traces — OpenTelemetry:

OTel SDK in microservices exports traces end-to-end

Identify latency bottlenecks across service dependencies (API → DB → cache)

Azure-native (platform layer):

Azure Monitor + Log Analytics for platform logs and KQL queries

Container Insights for AKS node/pod health

Application Insights for APM where enabled

Azure Service Health — rule out regional outages first

At HHS: On alert fire → check Grafana → drill Loki/Elasticsearch logs → OTel trace for latency → kubectl if K8s issue. MTTR improved by correlating all three pillars.

Innovation (if asked): We piloted a multi-agent LLM integration (Gemini/OpenAI API) that sends alert context + logs to the model and returns RCA hints and remediation steps — assists engineers but does not replace human approval for prod changes."

Mathew's Answer — What You Said

Tool	Your use
Prometheus	Metrics collection
Grafana	Dashboards + alerts
OpenTelemetry	Distributed traces, latency
LLM multi-agent	RCA + remediation advice from logs

14. How do you troubleshoot production issues?

Senior Answer (8–9 Years)

"Production issues are alert-driven. Grafana/Azure Monitor alerts fire for pod problems — CrashLoopBackOff, ImagePullBackOff, Evicted, Pending — or application-down alerts.

My 6-step process:

Triage — Check priority (P0/P1/P2), blast radius, how many users/states affected

Correlate — Azure Service Health, recent deployments, config/Key Vault changes

Observe — Grafana dashboards → Loki/Log Analytics logs → OpenTelemetry traces

Isolate — kubectl describe pod, kubectl get events, kubectl logs --previous

Mitigate first — rollback Helm release, fix image tag, scale nodes — restore service before root cause

Document — Update ticket with RCA, close after verification, post-mortem for P0/P1

Example — ImagePullBackOff on AKS:

kubectl describe pod → check Events section

Common causes: (1) wrong image tag not in ACR (2) AcrPull role missing on Managed Identity (3) auth to private ACR failed (4) PE/DNS issue

Fix: verify tag in ACR, check AcrPull RBAC, verify Workload Identity / service account, re-deploy

Example — App-to-app connectivity failure:

kubectl exec → ping service DNS name (not IP — IPs change)

If network OK → check environment variables, ConfigMaps, Secrets from Key Vault CSI

Check namespace ResourceQuota — pod may be Pending if quota exceeded

For gov healthcare: Communicate status to client on P0/P1 every 15–30 min until resolved."

🎫 Incident Priority Guide (P0 / P1 / P2 / P3) — Learn This

Companies vary slightly — this is the standard enterprise model interviewers expect:

Priority	Also called	Meaning	Response time	Who acts	Example (HHS healthcare)
P0	Sev-1 / Critical	Complete outage — production down, no workaround	Immediate (< 15 min)	All hands, incident commander, war room	All state portals down, PHI system unreachable
P1	Sev-2 / High	Major degradation — core feature broken, large user impact	< 30–60 min	On-call + platform lead + dev lead	One state can't submit claims; 50% error rate
P2	Sev-3 / Medium	Partial impact — workaround exists, limited users	< 4 hours	On-call engineer	Single microservice down, other states OK
P3	Sev-4 / Low	Minor — cosmetic, dev/test only, no prod impact	Next business day	Assign to backlog	Grafana dashboard broken, non-prod pod crash

15. Which databases have you worked with?

Senior Answer (8–9 Years)

"I've worked with several databases across our projects:

Database Use case Environment pattern

MongoDB Primary app datastore (HHS) Self-hosted in Dev/QA (cost savings); MongoDB Atlas (or managed) in Production

PostgreSQL Operational/relational data Azure Database for PostgreSQL Flexible Server — used for prod→stage data restore pipelines

Azure SQL Microsoft SQL workloads Transactional apps, HA with zone redundancy

Cosmos DB NoSQL / globally distributed Low-latency, multi-region scenarios

Environment strategy: Lower environments use lighter/cheaper DB (self-hosted MongoDB on AKS/VM) — production uses fully managed PaaS with HA. We don't replicate prod infrastructure in Dev/QA for cost optimization.

HA & DR (platform layer — my responsibility):

Enable zone-redundant HA on Azure PostgreSQL/SQL where required

Geo-redundant backup and point-in-time restore (PITR) on PaaS databases

Scheduled logical backups (pg_dump / mongodump) to Blob Storage for extra protection and cross-region recovery

Private Endpoints for DB connectivity from AKS — no public exposure

I'm not a full-time DBA — I own provisioning, connectivity, backup, HA config, and restore pipelines; DBAs/dev teams own query tuning."

Database	Use case	Environment pattern
MongoDB	Primary app datastore (HHS)	Self-hosted in Dev/QA (cost savings); MongoDB Atlas (or managed) in Production
PostgreSQL	Operational/relational data	Azure Database for PostgreSQL Flexible Server — used for prod→stage data restore pipelines
Azure SQL	Microsoft SQL workloads	Transactional apps, HA with zone redundancy
Cosmos DB	NoSQL / globally distributed	Low-latency, multi-region scenarios

16. What activities do you perform in database administration?

Senior Answer (8–9 Years)

"We had a dedicated DBA team, but as a platform engineer I owned the infrastructure and operational side of databases:

Provisioning & configuration:

Created and configured PostgreSQL Flexible Server, Cosmos DB, and MySQL via Terraform

Sized instances based on workload — vCores, storage, HA mode (zone-redundant)

Enabled Private Endpoints, disabled public access, configured Private DNS Zones

Connectivity & identity:

Troubleshot connection issues between AKS pods and databases

Set up Managed Identity and service accounts for apps to authenticate without passwords in code

Stored connection strings in Key Vault; apps pull via Key Vault CSI driver

Backup & environment sync:

Built Azure DevOps pipelines to dump non-critical prod data and restore to Dev/Stage — so developers work with realistic data while masking sensitive fields

Enabled native PITR on PaaS DBs; additional pg_dump/mongodump to Blob for pipeline-driven restores

Performance & cost (with DBA team):

Monitored vCore/DTU utilization, connection count, slow query logs in Azure Portal

Right-sized instances based on metrics — scale up/down for cost optimization

Escalated query tuning and indexing to DBAs — my role stops at platform layer

Boundary: I own infra, access, backup, connectivity, sizing — DBAs own schema, queries, indexes."

17. Have you handled production incidents?

Senior Answer (8–9 Years)

"Yes — regularly. At HHS, I was on-call for a healthcare platform serving US state governments. Incidents are alert-driven and prioritized P0/P1/P2/P3.

My incident process:

Triage — severity, blast radius, users affected

War room — Teams channel for P0/P1, assign incident commander

Mitigate first — rollback, scale, fix config — restore service before deep RCA

Investigate — Grafana → logs → OTel traces → kubectl describe/logs

Communicate — client status updates every 15–30 min on P0/P1

Post-mortem — RCA document, action items, alert tuning

Real example — ImagePullBackOff (P1):

Situation: After Terraform apply, multiple pods entered ImagePullBackOff — microservices couldn't pull from ACR

Action: kubectl describe pod → missing AcrPull role on Managed Identity after identity recreation → fixed RBAC in Terraform → redeployed

Result: Restored in ~10 minutes; added validation for MI permissions post-Terraform changes

Other incidents handled: CrashLoopBackOff (bad config/secret), pod eviction (resource pressure), connectivity failures after Key Vault rotation, slow AKS apps (DB/ingress bottlenecks).

Follow-the-sun: Onshore US + offshore India — handoff notes in ticket, clear ownership per incident."

Important Points

Mitigate first, RCA later — senior signal
Use real STAR example (ImagePullBackOff)
Gov healthcare → client communication on P0/P1

Likely Follow-Up Questions

Walk me through your ImagePullBackOff incident (STAR)
P1 vs P2 — how do you decide?
What goes in a post-mortem?

18. Have you coordinated with clients and cross-functional teams?

Senior Answer (8–9 Years)

"Yes — extensively. At HHS, we support US state government clients (Florida, Iowa, South Dakota) with a follow-the-sun model — onshore US team + offshore India team.

Cross-functional coordination:

Onshore (US): client-facing, business context, production access, stakeholder updates

Offshore (India): platform engineering, AKS ops, CI/CD, Terraform, incident troubleshooting

Developers: bug identification, app logs, env var / config fixes

DBA team: database performance, schema, data masking

Security: VA/PT remediation, compliance for gov healthcare

How we work:

Join Teams war rooms for P0/P1 — even if not our component, we contribute until resolved

Status updates every 15 minutes during critical incidents

I have led incident bridge calls — coordinate onshore + offshore + dev

For new infrastructure or cost optimization — we propose, onshore validates with client

Release windows coordinated across time zones — US team approves prod changes

Example: During ImagePullBackOff P1, I led troubleshooting on the bridge call — offshore fixed Terraform RBAC while onshore updated the client every 15 min until resolved in 10 minutes.

Strong rapport built over years — clear communication is as important as technical skills in gov healthcare."

SECTION 2: DEEP-DIVE TECHNICAL QUESTIONS

19. Explain the architecture of your current Azure environment.

Senior Answer (8–9 Years)

Overview: HHS healthcare platform — 30+ microservices on private AKS, provisioned via Terraform, US state government clients.

Traffic flow:
User (Browser → URL)
  → Azure DNS
  → Azure Front Door (global routing to nearest healthy region)
  → Application Gateway (WAF + SSL)
  → AGIC (Application Gateway Ingress Controller)
  → Ingress → Services → Pods
  → Databases (PostgreSQL / MongoDB)
Networking: Everything in a private VNet with dedicated subnets — AKS nodes, App Gateway, Private Endpoints in separate subnets. No public exposure on AKS. Azure Bastion for secure engineer access (no public SSH/RDP).

Integrations: ACR + Key Vault via Managed Identity and Workload Identity Federation. Private Endpoints on PaaS services. Azure DevOps CI/CD → ACR → Helm → AKS.

Observability: Prometheus, Grafana, OpenTelemetry, Loki/Elasticsearch.

Hub-spoke: Our current project uses a single VNet architecture (not hub-spoke). I understand hub-spoke conceptually — hub centralizes firewall/VPN/Bastion; spokes connect via VNet peering; on-premises via ExpressRoute/VPN to hub. Ready to implement if required."

20. How do you troubleshoot a VM that is suddenly unreachable?

Senior Answer (8–9 Years)

"I follow a layered checklist — platform first, then network, then OS:

1. Azure platform (Portal):

VM power state — running or stopped?

Resource Health + Azure Service Health — region outage?

Activity Log — recent NSG change, patch, reboot?

2. DNS (if using hostname):

nslookup — does FQDN resolve to correct public IP?

Verify Azure DNS A-record matches assigned IP

3. Networking:

NSG rules — inbound 22/3389 or app port allowed?

ASG (Application Security Group) rules if used

Azure Firewall rules if traffic routes through firewall

Effective routes — traffic going where expected?

Public IP assigned and associated to NIC?

Network Watcher → Connection Troubleshoot / IP Flow Verify

4. Access VM (when network looks OK):

Azure Bastion — connect without public IP

Serial Console + Boot Diagnostics — check boot logs if SSH/RDP fails (disk full, fstab error, kernel panic)

VMAccess extension — reset SSH key or local admin password

5. Metrics (Azure Monitor):

CPU, memory, disk space (>90% can cause failures), network I/O

High CPU or full disk often explains "unreachable" symptoms

6. Recovery:

Restore from Azure Backup snapshot

Attach OS disk to rescue VM if corrupted

Redeploy VM keeping data disk

Root cause examples I've seen: NSG rule change, disk full, failed patch, wrong DNS mapping, expired SSH key."

21. How would you perform DR testing in Azure?

Senior Answer (8–9 Years)

"DR testing validates our RTO/RPO commitments — we run both tabletop exercises and technical failover tests on a scheduled cadence (quarterly/annually per compliance).

1. Plan & scope:

Define test scenario — regional outage, datacenter failure, ransomware recovery

Document expected RTO (how fast we recover) and RPO (how much data we can lose)

Notify stakeholders — test window, expected impact (ideally non-prod first)

2. Azure Site Recovery (ASR) — VM DR test:

Run Test Failover (non-destructive) — VMs spin up in DR region on isolated network

Validate application starts, connectivity, data consistency

Cleanup test failover — does not affect production replication

Document actual RTO vs target

3. Azure Backup restore test:

Restore a VM or managed disk from Recovery Services Vault to isolated resource group

Verify data integrity and restore time

Test Key Vault backup/restore for secrets recovery

4. AKS DR test (Velero):

Restore namespace from Velero backup in Blob to same or DR cluster

Validate pods, services, ingress come up

Test StatefulSet PV restore from disk-level Azure Backup

5. Infrastructure DR test (Terraform):

terraform apply in DR region from version-controlled code

Validates IaC can recreate full environment

6. Traffic cutover test:

Front Door / DNS failover to DR region endpoint

Validate end-to-end user flow

7. Post-test:

Document results, gaps, action items

Update runbooks with actual timings

Report to compliance/client for gov healthcare audit trail

At HHS: Combined ASR test failover + Velero namespace restore + Terraform DR apply — full stack validation."

Important Points

Test type	Tool	Non-destructive?
VM failover	ASR Test Failover	Yes — isolated network
K8s restore	Velero	Yes — restore to test cluster
Disk/VM restore	Azure Backup	Yes — isolated RG
Infra recreate	Terraform	Yes — DR region/subscription

Likely Follow-Up Questions

Difference between ASR test failover and committed failover?
What is RTO vs RPO?
How do you DR test AKS without affecting production?

22. Explain the complete AKS architecture.

Senior Answer (8–9 Years)

1. Control Plane (Microsoft-managed — free in standard tier):

API Server — all kubectl/Helm commands go here; stores cluster state

etcd — distributed key-value store for cluster state

Scheduler — assigns pods to nodes

Controller Manager — maintains desired state (replicas, endpoints)

We never manage these — Azure handles HA, patching, upgrades

2. Node Pools (we manage — VM Scale Sets):

System node pool — runs kube-system pods (CoreDNS, metrics-server, monitoring agents). Keep stable; don't run app workloads here

User node pools — segregated by workload:

Normal workloads — standard VM SKUs

Spot node pool — cheaper, fault-tolerant batch/CI jobs

GPU node pool — ML/GPU workloads (if needed)

Each node runs: kubelet, kube-proxy, containerd (container runtime)

3. DaemonSets on every node:

CNI plugin (Azure CNI) — assigns pod IPs from VNet subnet; VNet integration

CSI drivers — connect pods to Azure services:

Disk CSI → Azure Managed Disks (PVs)

Key Vault CSI → secrets as volumes

ACR pull via Managed Identity (not CSI but identity layer)

4. Networking & Ingress:

Private AKS in VNet subnets — no public API or nodes

AGIC — Application Gateway Ingress Controller routes external traffic to pods

Network policies — restrict pod-to-pod traffic

5. Identity & integrations:

Managed Identity on cluster + Workload Identity Federation for pods

ACR image pull, Key Vault secrets — no credentials in code

6. Add-ons we run: Prometheus/Grafana, Velero, AGIC, Key Vault CSI, OpenTelemetry collector"

23. How do you perform AKS cluster upgrades?

Senior Answer (8–9 Years)

Pre-upgrade (planning):

Review Kubernetes release notes — deprecated APIs, breaking changes

Check compatibility: Helm charts, Argo CD, ingress (AGIC), monitoring addons

Run kubectl convert / API deprecation checks on manifests

Confirm application runtimes support target K8s version — coordinate with developers

Upgrade non-prod first; wait 1–2 weeks before production

Backup before upgrade:

Velero — namespace-level backup to Blob Storage

Disk snapshots — PVs / StatefulSets

Terraform (IaC) ready — can spin up new cluster in same or DR region if catastrophic failure

Drift detection pipeline — runs frequently; if infra drift detected, reconcile Terraform code before upgrade

Upgrade execution (order matters):

Control plane first — az aks upgrade --kubernetes-version x.y.z — API server briefly unavailable

Node pools one by one — never all at once

Check PodDisruptionBudgets (PDB) and replica counts before drain

Use surge settings (maxSurge: 33%) — cordon → drain → new nodes → verify pods healthy

Only proceed to next node pool when all pods are running

Add-ons / plugins last — AGIC, Key Vault CSI, Prometheus, Velero — upgrade to compatible versions

Post-upgrade validation:

kubectl get nodes — all Ready on new version

Smoke tests, Grafana dashboards clean, no CrashLoopBackOff

Rollback plan: new node pool on old K8s version + migrate workloads if needed

Azure supports last 3 K8s versions — upgrade within support window before version goes end-of-life."

24. How do you troubleshoot pods in CrashLoopBackOff state?

Senior Answer (8–9 Years)

Alert fires in Grafana — pod restart count or CrashLoopBackOff state.

Step 1 — Describe pod (always first):
kubectl describe pod <pod-name> -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
Check Events section — exit code, OOMKilled, probe failure, CreateContainerConfigError

Step 2 — Previous container logs (if describe unclear):
kubectl logs <pod-name> -n <namespace> --previous
--previous is critical — current container may not have logs yet; previous shows why last crash happened

Step 3 — Identify root cause and fix:

Cause	What to check	Fix
App crash	`--previous` logs, stack trace	Fix code/config, rollback Helm release
Missing env vars / secrets	describe → CreateContainerConfigError	ConfigMap, Secret, Key Vault CSI mount
OOMKilled	describe → `OOMKilled`, `kubectl top pod`	Increase memory limits, fix memory leak
Liveness probe too aggressive	describe → probe failures	Increase `initialDelaySeconds`, `failureThreshold`
Wrong image / tag	describe → ErrImagePull (different from CrashLoop)	Fix image tag in deployment
Init container failure	describe → init container status	`kubectl logs <pod> -c <init-container>`
Permission / securityContext	describe → permission denied	Fix `runAsUser`, `fsGroup`, volume mounts

Step 4 — Mitigate: helm rollback or fix manifest → redeploy → verify pod Running in Grafana

Real example (Q17): ImagePullBackOff was auth issue — different state but same describe/logs workflow."

25. Explain HPA and Cluster Autoscaler.

Senior Answer (8–9 Years)

HPA (Horizontal Pod Autoscaler):

Scales number of pods in a Deployment/StatefulSet horizontally

Metrics Server (or Prometheus adapter) collects CPU/memory metrics

HPA compares average metric vs target threshold — e.g., CPU > 70%

Scales up (add pods) or down (remove pods) automatically

Scope: single workload (one Deployment)

Cluster Autoscaler:

Scales number of nodes in a node pool

Triggers when pods are Pending — cannot be scheduled because no node has enough CPU/memory

Adds nodes to node pool; removes underutilized nodes when pods can be rescheduled elsewhere

Scope: entire node pool

How they work together:
Traffic spike → HPA adds more pods
    → Pods go Pending (no node capacity)
    → Cluster Autoscaler adds nodes
    → Pods scheduled → service handles load
Bonus: KEDA for event-driven scaling — e.g., Azure Service Bus queue depth, HTTP rate — beyond CPU/memory."

Important Points

HPA scales pods; Cluster Autoscaler scales nodes — different layers
HPA uses Metrics Server (CPU/memory) or Prometheus adapter / KEDA for custom metrics
Cluster Autoscaler triggers on Pending pods (unschedulable), not node CPU utilization graphs
Flow: traffic ↑ → HPA adds pods → pods Pending → CA adds nodes
Configure minReplicas/maxReplicas (HPA) and min-count/max-count (node pool + CA)
KEDA — event-driven scale (Azure Service Bus queue depth, HTTP rate)

Likely Follow-Up Questions

Follow-up	Answer
HPA not scaling?	Metrics Server missing, wrong apiVersion, requests/limits not set on pods
CA not adding nodes?	Pod limits, max node count reached, CA not enabled on pool
HPA vs VPA?	HPA = horizontal (more pods); VPA = vertical (bigger pod resources)

26. A production application is running slow in AKS. How will you troubleshoot it?

Senior Answer (8–9 Years)

1. Triage — define scope:

All users or one state? All APIs or one endpoint? When did it start? Recent deployment?

2. Observability (your approach — expand):

Grafana dashboards — latency P95/P99, error rate, pod restart spikes

kubectl top pods — CPU/memory per pod; throttling?

kubectl top nodes — node pressure?

Namespace + pod events: kubectl get events -n <ns> --sort-by='.lastTimestamp'

Pod logs: kubectl logs <pod> -f — errors, timeouts

OpenTelemetry traces — which dependency is slow? (API → DB → cache)

3. Layer by layer:

Layer	Check	Fix
Ingress	App Gateway / Front Door latency, 5xx	WAF rule, backend pool health
Pods	CPU/mem high → resource issue	Scale HPA, increase limits, add nodes (Cluster Autoscaler)
App	Logs — exceptions, thread pool exhaustion	Rollback deploy, fix code
Database	PostgreSQL vCore/connections, slow queries	Scale DB, connection pool, escalate to DBA
Cache	Redis hit ratio, evictions	Increase Redis memory
Network	Private Endpoint DNS latency	Fix DNS, NSG
External API	OTel dependency span slow	Contact vendor, add timeout/circuit breaker

4. Mitigate first: Scale HPA, add nodes, rollback bad Helm release — restore SLA, RCA after.

5. Your quick path (valid for interview start):
Grafana alert → kubectl top pod → events → logs
  → CPU/mem high? → HPA scale / increase limits
  → DB slow in traces? → check PostgreSQL metrics
  → Recent deploy? → helm rollback
```"

Important Points

Start with scope (who, when, which API) and correlate with deploy
Grafana + kubectl top + events + logs + OpenTelemetry traces
Slowness is often database or ingress — not always pod CPU
Mitigate first: HPA scale, helm rollback, DB scale — then RCA

Likely Follow-Up Questions

Follow-up	Answer
How find slow dependency?	OpenTelemetry trace waterfall — longest span
CPU normal but slow?	Check DB connections, external APIs, thread pool

27. Explain how VNets, Subnets, NSGs, and Route Tables work together.

Senior Answer (8–9 Years)

VNet: Top-level network boundary (e.g., 10.0.0.0/16)

Subnet: Segments within VNet (e.g., 10.0.1.0/24 for web, 10.0.2.0/24 for data). Resources get IPs from subnet range.

NSG: Attached to subnet or NIC — filters traffic (allow/deny) based on 5-tuple rules. Evaluated by priority.

Route Table (UDR): Attached to subnet — controls where traffic is routed (default system routes vs. custom: force tunnel through Firewall, route to NVA, BGP from ExpressRoute).

Flow example: Pod in AKS → subnet route table sends egress to Azure Firewall → NSG on subnet allows outbound 443 → Firewall applies application rules → internet.

Effective routes + effective security rules in Portal show the actual result of all combined rules.

Important Points

VNet = address space; Subnet = segment; resources get IPs from subnet
NSG = allow/deny traffic (L4); attached to subnet or NIC; rules by priority (5-tuple)
Route Table (UDR) = where packets go; overrides system routes
NAT Gateway / Firewall = outbound internet for private resources (SNAT / inspection)
NSG filters — Route Table routes — do not confuse the two
Evaluate Effective routes + Effective security rules in Portal

Inbound: Browser → DNS → Front Door → App Gateway → AGIC → Ingress → Service → Pod

Outbound: Pod → Route Table → NSG → Firewall/NAT Gateway → Internet (NOT through AGIC)

Likely Follow-Up Questions

Follow-up	Answer
NSG vs Firewall?	NSG = distributed L4; Firewall = centralized L3–L7 + threat intel
AKS pod egress?	Azure CNI assigns pod IPs — UDR on pod subnet controls egress path

28. Explain ExpressRoute and VPN connectivity.

Senior Answer (8–9 Years)

Aspect	Site-to-Site VPN	ExpressRoute
Path	Encrypted over public internet (IPsec/IKE)	Private dedicated connection via connectivity provider
Bandwidth	Up to ~1.25 Gbps (VpnGw scale)	50 Mbps to 100 Gbps
Latency	Variable	Consistent, lower
Cost	Lower	Higher (provider + port fees)
Use case	Branch offices, dev/test, backup	Enterprise production, compliance, high throughput

"In hybrid setups I often use ExpressRoute as primary and VPN as backup (ExpressRoute + VPN coexistence). BGP propagates on-premises routes into Azure VNet. For hub-spoke, route tables in spokes point to hub NVA/Firewall for inspection."

Important Points

VPN (Site-to-Site): Public internet + IPsec/IKE tunnel → VPN Gateway on VNet
ExpressRoute: Private dedicated circuit via connectivity provider → ExpressRoute Gateway on VNet
Do not mix gateways — VPN Gateway ≠ ExpressRoute Gateway
VPN: lower cost, lower bandwidth (~1.25 Gbps max), good for branch/dev/failover
ExpressRoute: higher cost, 50 Mbps–100 Gbps, consistent latency, compliance/production
Together: ExpressRoute primary + VPN backup if ExpressRoute fails
BGP exchanges routes between on-prem and Azure automatically

Likely Follow-Up Questions

Follow-up	Answer
ExpressRoute vs VPN for healthcare gov?	ExpressRoute for prod/compliance; VPN as backup
What is BGP?	Dynamic route exchange — on-prem networks advertised to Azure and vice versa

29. Explain your Azure DevOps pipeline architecture.

Senior Answer (8–9 Years)

Our Azure DevOps architecture has CI and CD separated.

CI: Trigger on merge to develop → checkout → GitLeaks → Trivy FS → build → unit tests → SonarQube quality gate → Docker build → Trivy image scan → push to ACR with tag (buildId + git SHA).

CD: Promote same image tag through Dev (auto) → QA (tests) → Prod (manual approval). Deploy to AKS via Helm. Service connections use Workload Identity Federation — no secrets in YAML.

Argo CD handles GitOps where Helm values/manifests in Git auto-sync to cluster. Terraform runs in a separate pipeline for infrastructure.

Important Points

CI = build, test, scan, publish image | CD = deploy per environment
Never deploy latest to Prod — promote same tested ACR tag
GitLeaks (secrets) → Trivy (vulns) → SonarQube (quality gate) = DevSecOps chain
Helm = pipeline deploys | Argo CD = Git changes auto-sync (GitOps)
Terraform = separate IaC pipeline (plan on PR, apply on merge)
Workload Identity Federation for service connections — no PATs/secrets in YAML

Likely Follow-Up Questions

Follow-up	Answer
SonarQube gate fails?	Pipeline stops — fix code or get approved exception
Helm vs Argo CD?	Helm = pipeline pushes; Argo CD watches Git repo and reconciles cluster state
How secure ACR push?	Service connection + Managed Identity / Workload Identity Federation

30. Explain Blue-Green and Canary deployments.

Senior Answer (8–9 Years)

Blue-Green: Two identical environments — Blue = live, Green = new version. On deploy, switch all traffic instantly from Blue to Green. If issues → switch back to Blue immediately. Fast rollback, but 2x cost during deploy.

Canary: Same cluster, two versions running. Route traffic gradually by percentage (5% → 25% → 50% → 100%) to the new version. Monitor metrics/errors. If problems → stop canary (only small % affected). On AKS: Ingress weights, Flagger, or Argo Rollouts.

Rolling update (AKS default): Replace pods incrementally. Control with maxSurge / maxUnavailable. PDB ensures minimum availability. Rollback: kubectl rollout undo or helm rollback.

Important Points

Blue-Green = instant switch (not split traffic)
Canary = percentage gradual (not two full environments)
Rolling = default K8s strategy — pod-by-pod replacement
Blue-Green rollback = switch traffic back | Rolling = rollout undo | Canary = set canary weight to 0%

31. How do you perform production rollbacks?

Senior Answer (8–9 Years)

Application rollback (AKS):
helm rollback <release> <revision> -n <namespace>
# or
kubectl rollout undo deployment/<name> -n <namespace>
Pipeline rollback: Redeploy last known-good build from ACR (immutable tags by git SHA, never latest in prod).

Database: Backward-compatible migrations only; use expand-contract pattern; rollback app before rolling back schema.

Infrastructure: terraform apply previous state from version control; never manual portal changes.

Process: Rollback is pre-documented in runbook; triggered when error rate > threshold for 5 min post-deploy; incident channel notified; post-mortem required."

32. Which monitoring tools have you integrated?

Senior Answer (8–9 Years)

"We run a three-pillar observability stack on AKS — metrics, logs, and traces — integrated through Grafana as the single pane of glass.

Metrics — Prometheus + Grafana:

Prometheus scrapes AKS node and pod metrics — CPU, memory, restarts, HTTP error rates

Grafana dashboards per service with alert rules for pod failures — CrashLoopBackOff, ImagePullBackOff, high restart count, node disk pressure

Alerts route to Teams/email for on-call

Logs — Loki + Promtail:

Promtail ships pod logs from every node to Loki

Grafana queries Loki for log drill-down when an alert fires

Traces — OpenTelemetry:

OTel SDK in microservices exports distributed traces

Identify latency bottlenecks across dependencies (API → DB → cache)

Azure-native (platform layer):

Azure Monitor + Log Analytics for platform and audit logs

Container Insights for AKS cluster health

Application Insights for APM where enabled

Azure Service Health — rule out regional outages first

Integration workflow at HHS: Alert fires in Grafana → check dashboard → drill Loki logs → OTel trace for latency → kubectl if Kubernetes issue. Correlating all three pillars reduced MTTR.

CI/CD: Azure DevOps deployment markers correlated with metric spikes post-release."

Important Points

Pillar	Tool	Integration
Metrics	Prometheus → Grafana	Scrape pods/nodes; alert rules → Teams
Logs	Promtail → Loki → Grafana	Log drill-down from alert
Traces	OpenTelemetry → Grafana/backend	End-to-end latency
Platform	Azure Monitor, Container Insights	AKS/node health, KQL queries

Likely Follow-Up Questions

How do Prometheus and Grafana connect? (ServiceMonitor / scrape config)
What happens when an alert fires? (dashboard → logs → traces → kubectl)
Difference between Container Insights and Prometheus?
How do you avoid alert fatigue?

33. What alerts do you usually configure in production environments?

Senior Answer (8–9 Years)

"In production we configure category-based alerts — each with a linked runbook. We deliberately reduced alert count so every alert is actionable — no noise.

Category Examples Severity

Availability CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending, HTTP 5xx rate P1 / Sev 1

Performance P95 latency > 2s, CPU > 85%, memory > 90% P2 / Sev 2

Capacity Disk > 80%, node pool near max, DB connection pool exhaustion P2

Security Key Vault access anomalies, failed login spikes, Defender for Cloud alerts P1–P2

Backup/DR Backup job failed, ASR replication lag P2

Certificate TLS cert expiry < 30 days P2

Cost Budget threshold 80%/100% P3

Alerts route to Teams/on-call. On fire → follow runbook → Grafana → logs → mitigate. We tune thresholds post-incident to keep false positives low."

Category	Examples	Severity
Availability	CrashLoopBackOff, OOMKilled, ImagePullBackOff, Pending, HTTP 5xx rate	P1 / Sev 1
Performance	P95 latency > 2s, CPU > 85%, memory > 90%	P2 / Sev 2
Capacity	Disk > 80%, node pool near max, DB connection pool exhaustion	P2
Security	Key Vault access anomalies, failed login spikes, Defender for Cloud alerts	P1–P2
Backup/DR	Backup job failed, ASR replication lag	P2
Certificate	TLS cert expiry < 30 days	P2
Cost	Budget threshold 80%/100%	P3

Important Points

Every alert = runbook — no alert without an action
Availability = P1 — pod failures and 5xx directly impact users
Backup job failed ≠ CI/CD pipeline failure (say the right one in interview)
Tune regularly — alert fatigue kills on-call effectiveness

Likely Follow-Up Questions

How do you reduce alert fatigue?
What's P1 vs P2 in your org?
Example runbook for CrashLoopBackOff?

34. How do you monitor MongoDB performance?

Senior Answer (8–9 Years)

"Production uses MongoDB Atlas. We rely on Atlas-native monitoring — Performance Advisor, slow query logs, and Real-Time Performance Panel — plus alerts on connections, replication lag, and opcounters.

Key metrics we watch:

CPU, memory, disk usage

Active connections and connection pool exhaustion from apps

Replication lag (replica set health)

Slow queries and missing indexes — look for COLLSCAN (collection scans, not table scans — MongoDB uses collections)

Query execution times and index usage

Integration: MongoDB exporter → Prometheus → Grafana dashboards for unified observability alongside AKS metrics.

Lower environments: Same exporter pipeline; we also use mongostat / explain() on slow queries for troubleshooting.

Red flags: COLLSCAN on large collections, rising connection count, replication lag > threshold, working set exceeding RAM."

Important Points

Term	Correct
MongoDB data unit	Collections (not tables)
Missing index symptom	COLLSCAN in query plan
Your stack	Atlas + MongoDB exporter → Prometheus → Grafana

Likely Follow-Up Questions

What is COLLSCAN and how do you fix it?
How do you find slow queries in Atlas?
Connection pool exhaustion — what causes it?

35. Explain Redis replication and failover.

Senior Answer (8–9 Years)

"Redis replication creates an asynchronous copy of the primary for high availability and read scaling. The primary handles writes; replicas sync via PSYNC.

Lower environments — self-managed Redis:

Master + replica topology with Redis Sentinel

Sentinel monitors the primary; on failure it elects and promotes a replica, then apps reconnect to the new primary

Production — Azure Cache for Redis Premium:

Primary-replica replication with automatic failover (Azure-managed on Premium)

Zone redundancy for HA across availability zones

Private Endpoint — AKS pods connect over the VNet; Private DNS zone resolves the Redis hostname inside the cluster network

Monitoring: used_memory, evicted_keys, connected_clients, replication lag

App side: Connection multiplexer with retry/reconnect logic so failover is transparent to the application."

Important Points

Environment	Setup	Failover
Lower env	Self-hosted + Sentinel	Sentinel promotes replica
Production	Azure Redis Premium + zone redundancy	Azure automatic failover
Network	Private Endpoint + Private DNS zone	Pods in AKS VNet only

Likely Follow-Up Questions

Sync vs async replication — what can you lose on failover?
Difference between Sentinel and Redis Cluster?
How does Private Endpoint DNS work for Redis from AKS?

36. Explain your experience with Cloud Run, BigQuery, or Airflow.

Senior Answer (8–9 Years)

"My primary cloud is Azure, so most of my production work is AKS, Azure DevOps, and Azure data services.

Cloud Run and BigQuery: I don't have deep hands-on production experience with these. I understand the concepts — Cloud Run is serverless containers (similar to Azure Container Apps), and BigQuery is a serverless analytics warehouse (similar to Azure Synapse / serverless SQL patterns).

Airflow — I have practical experience:

Used to author, schedule, and monitor workflows

Workflows are defined as DAGs (Directed Acyclic Graphs) in Python code

Each DAG has tasks with dependencies — upstream must succeed before downstream runs

Supports retries, scheduling (cron), and a UI to track run status

Pipelines are version-controlled in Git — scalable and repeatable for ETL/batch jobs

The orchestration concepts transfer: Airflow DAG patterns map closely to Azure Data Factory pipelines and activity dependencies."

Important Points

GCP service	If no hands-on	Azure equivalent
Cloud Run	Serverless containers	Container Apps
BigQuery	Analytics warehouse	Synapse / analytics SQL
Airflow	DAG-based orchestration	Data Factory pipeline patterns

Interview rule: Be honest on gaps; show you know equivalents and own Airflow clearly.

Likely Follow-Up Questions

What is a DAG in Airflow?
How do task dependencies work?
Airflow vs Azure Data Factory?

37. How do you manage monitoring across Azure and GCP?

Senior Answer (8–9 Years)

"My production depth is Azure, but the multi-cloud pattern we follow — or would follow — is one observability stack, not two.

Unified OSS stack on both clouds:

Deploy the same Prometheus + Grafana + Loki (and ELK where needed) on AKS and GKE

Single pane of glass — Grafana dashboards with mixed data sources from both clouds

Standardized instrumentation:

OpenTelemetry in all microservices — self-hosted collector or export to Application Insights on Azure; same OTel pattern on GCP for trace correlation across services

Platform layer per cloud:

Azure: Azure Monitor, Container Insights, Cost Management — resources tagged (environment, team, cost-center) for segregation

GCP: Cloud Monitoring + Logging forwarded to the same central sink (or scraped into Prometheus)

Operations: One on-call rotation, one runbook, one escalation policy (Teams/PagerDuty) — avoid duplicating alert logic per cloud. Cloud-specific dashboards only where the platform differs.

Cost: Azure Cost Management + GCP Billing export — unified FinOps view via tags."

Important Points

Principle	How
Same stack both clouds	Prometheus/Grafana/Loki on AKS + GKE
Trace correlation	OpenTelemetry everywhere
No duplicate alerts	One runbook, one on-call
Segregation	Consistent tags on Azure AND GCP

Likely Follow-Up Questions

How do you correlate traces across Azure and GCP services?
Grafana mixed data sources — how configured?
How avoid alert fatigue across two clouds?

38. How do you perform security hardening in Azure?

Senior Answer (8–9 Years)

"We apply layered security across Azure — identity, network, compute, data, governance, secrets, and CI/CD.

Identity: Enforce MFA, Conditional Access, PIM for admin roles — no standing privileged access, no shared accounts.

Network: WAF on App Gateway/Front Door, Azure Firewall for outbound control, NSGs on subnets, Private Endpoints for PaaS services, DDoS Protection on public-facing resources. No public IPs on production workloads where possible.

Compute / AKS: Private AKS cluster, ACR with trusted images only, Defender for Cloud image scanning, Pod Security Standards, Workload Identity / Managed Identity — no static credentials in pods.

Secrets: Key Vault for secrets, certs, and keys — accessed via Managed Identity, never in code or repos.

Data: Encryption at rest (CMK via Key Vault), TLS 1.2+ enforced, soft delete on storage and Key Vault.

Governance: Azure Policy — deny public storage, require tags, enforce HTTPS, block non-compliant resources at deploy time.

CI/CD (shift-left): GitLeaks, Trivy image scanning, SonarQube in Azure DevOps pipeline before anything reaches prod.

Monitoring: Defender for Cloud secure score, alerts on misconfigurations and anomalies.

Everything is codified in Terraform so security baseline is repeatable and drift is caught early."

Important Points

Layer	Your HHS examples
Identity	MFA, Conditional Access, PIM
Network	WAF, Firewall, NSG, Private Endpoints
Compute	Private AKS, ACR, Defender scanning
Secrets	Key Vault + Managed Identity
Governance	Azure Policy
CI/CD	GitLeaks, Trivy, SonarQube

Likely Follow-Up Questions

What is PIM and why use it?
Private Endpoint vs Service Endpoint?
How does Workload Identity work with Key Vault?

39. Have you handled VA/PT findings? Give examples.

Senior Answer (8–9 Years)

"Yes. I participated in vulnerability scanning and penetration testing activities and owned infra-side remediation from scan reports.

Application / image scanning:

Trivy in CI pipeline flagged CVEs in container images → we updated base images and rebuilt pipelines

OWASP ZAP integrated in Jenkins pipeline for DAST on applications → shared findings with dev team for code fixes

Azure infrastructure findings and remediation:

Soft delete not enabled on critical storage/Key Vault → enabled soft delete + purge protection

NSG ports open to 0.0.0.0/0 → restricted to known IP ranges / Bastion, Azure Policy to deny wide-open rules

Public endpoints where Private Endpoints were required → migrated PaaS to Private Endpoint + Private DNS

TLS 1.0/1.1 on App Gateway → enforced TLS 1.2 minimum SSL policy, rescanned clean

AKS secrets in plain Kubernetes Secrets (base64, not encrypted) → migrated to Key Vault CSI driver + Workload Identity

Overprivileged users / service principals → scoped Azure RBAC to least privilege, custom roles instead of Contributor

Process: Triage by severity/CVSS → assign owner → remediate → rescan/retest → close ticket. Critical findings within agreed SLA."

Important Points

Term	Azure interview
IAM	Say Azure RBAC (not IAM — that's AWS)
K8s Secrets	Base64 ≠ encryption — use Key Vault CSI
ZAP	OWASP ZAP — DAST tool
Your tools	Trivy (images), ZAP (apps), config review (infra)

Likely Follow-Up Questions

Difference between VA scan and PT?
How do you prioritize findings?
What is Key Vault CSI vs Kubernetes Secrets?

40. How do you secure AKS clusters?

Senior Answer (8–9 Years)

"Cluster access:

Private AKS cluster — API server inside the VNet, not publicly accessible

Microsoft Entra ID integration with Azure RBAC for Kubernetes — authorized access only, no shared kubeconfig

Network:

Network Policies — deny all by default, allow only required pod-to-pod traffic

WAF on Azure Front Door for ingress traffic to workloads behind the cluster

Identity & secrets:

Workload Identity Federation + Managed Identity — pods access Key Vault via CSI driver, no static credentials or plain Kubernetes Secrets

Images & supply chain:

ACR as trusted registry only; Defender for Cloud image scanning + Trivy in CI pipeline

Governance:

Azure Policy add-on for AKS — enforce no privileged containers, required labels, approved SKUs

Pod Security Admission (restricted/baseline)

Operations:

Regular Kubernetes version upgrades and CVE patching

Audit logs to Log Analytics"

Important Points

Your term	Interview polish
Entra ID	Correct — say Microsoft Entra ID + Azure RBAC for K8s
Federated identity	Workload Identity Federation
Policy engine	Azure Policy add-on for AKS
WAF + Front Door	Valid — ingress layer for apps on AKS

Likely Follow-Up Questions

Private cluster — how do you run kubectl/CI/CD?
Workload Identity vs Managed Identity?
Network Policy example — deny all ingress?

41. How do you identify cloud cost leakages?

Senior Answer (8–9 Years)

"I use Azure Cost Management, Azure Monitor, and Azure Advisor to identify where money is being wasted.

How I find leakages:

Cost Analysis by resource and tag — spot spikes and unattached resources

Advisor recommendations — right-sizing VMs and databases, idle resources

Hunt for orphaned disks, snapshots, NICs, and public IPs still billing

Review non-prod running 24/7 — VMs and AKS node pools with no auto-scale-down

Check storage tiers — Premium disks or Hot blob where Cool/Archive fits

Remediation / prevention:

Right-size VMs and DB SKUs per Advisor

Remove orphan resources and extra snapshots

Cluster Autoscaler on AKS — scale nodes to demand, avoid idle capacity

VM auto start/stop schedules for off-hours (dev/test)

Blob lifecycle policies — move logs/backups to Cool/Archive

Reserved Instances / Savings Plans for predictable steady workloads

Tags via Azure Policy for cost allocation and showback

Cost leakage is usually orphaned resources and wrong SKU/tier — monthly FinOps review with Advisor + Cost Management."

Important Points

Identify (find waste)	Prevent (fix waste)
Cost Analysis, Advisor	Right-sizing, cleanup
Orphan disks/snapshots/IPs	Delete + Policy guardrails
Idle non-prod 24/7	Auto-shutdown, Cluster Autoscaler
Wrong storage tier	Lifecycle policies

Likely Follow-Up Questions

Difference between RI and Savings Plans? (Q43)
How does Cluster Autoscaler save money?
What tags do you enforce for FinOps?

42. What cost optimization initiatives have you implemented?

Senior Answer (8–9 Years)

Reserved Instances / Savings Plans for stable AKS node pools and SQL — ~35% savings

Spot node pools for CI/CD and batch workloads — ~70% compute savings on those pools

Auto-shutdown schedules for dev/test VMs and scaled-down non-prod AKS at night

Storage lifecycle policies — moved logs/backups to Cool/Archive tier — ~40% storage reduction

Rightsizing via Azure Advisor recommendations — downsized 15 VMs

Tagging enforcement via Policy — enabled showback/chargeback per team

Consolidated Log Analytics workspaces to reduce ingestion duplication

43. Explain Reserved Instances and Savings Plans.

Senior Answer (8–9 Years)

Option	Commitment	Flexibility	Discount	Best For
Reserved Instances (RI)	1 or 3 year, specific VM family + region	Low — exchange possible for different SKU/region	Up to ~72%	Stable, predictable workloads (fixed SQL tier)
Savings Plans	1 or 3 year, $/hour spend commitment	High — change VM type, region, applies across compute	Up to ~65%	Mixed/evolving workloads (AKS node pools)

"Both are Azure cost-saving commitment options for 1–3 years.

Reserved Instances lock to a VM family and region — up to 72% discount. You can exchange RIs if needs change, but flexibility is limited.

Savings Plans commit to an hourly dollar spend — up to 65% discount — and flex across VM types and regions. Better when SKUs change, like AKS node pool upgrades.

I analyze 30–90 day utilization in Cost Management before purchasing. Savings Plans are my default for AKS; RIs for steady SQL/Cosmos tiers. Combined with Spot pools for CI/batch, we reduce overall compute cost significantly."

Important Points

Savings Plan = $/hour commit, flexible VM family
RI = specific SKU, higher discount, less flexible
AKS → prefer Savings Plans (node SKUs evolve)
Always check 30–90 day usage before committing

Likely Follow-Up Questions

Can you use RI and Savings Plan together?
What happens if you under-utilize the commitment?
Spot vs Savings Plan for AKS?

44. How do you optimize storage and idle resources?

Senior Answer (8–9 Years)

"Storage optimization:

Blob lifecycle policies — move logs/backups Hot → Cool → Archive automatically

Delete orphaned disks, snapshots, and old unwanted archives

Right redundancy tier — LRS for non-critical, ZRS for HA

Idle resource optimization:

Use Azure Monitor and Advisor to find unused VMs, disks, public IPs, NICs

Deallocate/remove idle VMs and associated disks

Auto-shutdown schedules for dev/test VMs off-hours

Scale down non-prod AKS node pools when not in use

Monthly runbook to clean orphaned storage and unattached resources"

Important Points

Monitor + Advisor = find waste
Lifecycle policies = prevent storage creep
Orphan cleanup = disks, snapshots, NICs, IPs

45. Database response time has increased significantly. What will be your approach?

Senior Answer (8–9 Years)

"I follow a structured top-down approach — confirm scope, check infra, then drill into the database.

1. Confirm & correlate:

When did it start? Recent deploy, config change, or Key Vault rotation?

Check Grafana/App Insights — latency spike on DB dependency span in OpenTelemetry traces

Rule out AKS/network — pod CPU, connection errors, Private Endpoint issues

2. Database metrics (platform layer):

Azure SQL / Cosmos: DTU/vCore utilization, connection count, deadlocks, wait stats

MongoDB Atlas: Performance Advisor, slow query logs, opcounters, replication lag, connections

Postgres: active connections, lock waits, pg_stat_statements

3. Query analysis:

Identify slow queries — missing indexes, COLLSCAN, full table scans

Run explain() / Query Performance Insight

Check if app released a bad query or N+1 pattern after deploy

4. Resource & capacity:

CPU/memory/IO saturation on DB tier — need scale-up?

Connection pool exhaustion from AKS pods — too many replicas or leak?

Redis cache miss rate — is load hitting DB directly?

5. Mitigate & escalate:

Short-term: scale DB tier, kill long-running queries, enable read replica

Engage DBA team for schema/index fixes

Document RCA — index added, query optimized, pool size tuned

At HHS: Platform owns infra metrics and connectivity; DBA owns schema/query tuning — I bridge both with traces and connection troubleshooting."

Important Points

Layer	Check
App/trace	OTel — is DB span the bottleneck?
Connectivity	Private Endpoint, NSG, connection string
DB metrics	DTU/vCore, connections, replication lag
Queries	Slow query log, missing indexes, COLLSCAN
Cache	Redis hit rate — bypass cache?

Likely Follow-Up Questions

How find slow query in MongoDB Atlas?
Connection pool exhaustion — symptoms?
Scale-up vs scale-out for Azure SQL?

46. How do you prioritize multiple production issues?

Senior Answer (8–9 Years)

"I use impact × urgency matrix:

Priority Criteria Example

P1 Full outage, revenue/security impact Payment down, data breach

P2 Major degradation, workaround exists Slow checkout, single region

P3 Minor impact, few users Non-critical report failing

P4 Cosmetic / planned UI alignment issue

Rules:

P1 always wins — all hands, incident commander assigned

If two P1s: business impact (revenue > internal tools)

Communicate reprioritization to stakeholders

Don't context-switch — assign owners per incident

Post-incident: review if alerts should have fired earlier"

Priority	Criteria	Example
P1	Full outage, revenue/security impact	Payment down, data breach
P2	Major degradation, workaround exists	Slow checkout, single region
P3	Minor impact, few users	Non-critical report failing
P4	Cosmetic / planned	UI alignment issue

47. Describe a critical production issue that you resolved.

Senior Answer (8–9 Years)

Situation: At HHS, after a Terraform apply that recreated cluster identity resources, multiple microservices entered ImagePullBackOff — pods could not pull container images from ACR. This was a P1 — production workloads for state government clients were failing to start. Grafana alert fired on pod restart count.

Task: As on-call platform engineer, I needed to restore image pull capability and get pods running — fast.

Action:

Joined Teams war room, confirmed no Azure regional outage (Service Health clean)

kubectl describe pod → Events showed 401/403 unauthorized pulling from ACR

Correlated with recent Terraform apply — Managed Identity was recreated but AcrPull role assignment was missing

Fixed RBAC in Terraform — re-applied AcrPull on the kubelet identity → ACR

Triggered pod restart / redeploy → verified images pulling and pods Running

Notified client team of status and resolution

Result:

Service restored in ~10 minutes

Added Terraform validation step to verify MI role assignments post-apply

Added Grafana alert for ImagePullBackOff with linked runbook

Documented RCA in ticket; shared learnings in post-mortem

Lesson: Infrastructure changes that touch identity must include RBAC validation — ACR pull depends on AcrPull role on the cluster's Managed Identity."

Important Points

Use your real story — ImagePullBackOff, not a generic outage
STAR format: Situation → Task → Action → Result
Root cause: Terraform + missing AcrPull — shows depth

Likely Follow-Up Questions

How did you diagnose it was ACR auth and not a bad image tag?
What is AcrPull role?
How prevent this in future?

SECTION 3: QUICK-FIRE CHEAT SHEET

Kubernetes Commands to Know

kubectl get pods -A -o wide
kubectl describe pod <name> -n <ns>
kubectl logs <pod> -n <ns> --previous -f
kubectl top pods/nodes -A
kubectl rollout status deployment/<name>
kubectl rollout undo deployment/<name>
kubectl exec -it <pod> -n <ns> -- /bin/sh
kubectl get events -n <ns> --sort-by='.lastTimestamp'

Useful KQL Snippets

// Failed requests last hour
requests | where timestamp > ago(1h) and success == false | summarize count() by resultCode, name

// Exceptions spike
exceptions | where timestamp > ago(1h) | summarize count() by type, outerMessage

// Pod restarts
KubePodInventory | where PodRestartCount > 0 | project Computer, Name, PodRestartCount

Azure CLI Essentials

az aks get-credentials --resource-group <rg> --name <cluster>
az vm list-usage --location eastus
az backup recoverypoint list --vault-name <vault> --container-name <vm>
az network watcher test-connectivity --source-resource <vm> --dest-address <ip>
az aks upgrade --resource-group <rg> --name <cluster> --kubernetes-version 1.28.5

SECTION 4: QUESTIONS TO ASK THE INTERVIEWER

What does the Azure landing zone / platform team structure look like?
What is the current split between IaaS, PaaS, and AKS workloads?
How mature are CI/CD and GitOps practices today?
What does the on-call rotation and incident management process look like?
Are there multi-cloud requirements (Azure + GCP/AWS)?
What are the top 3 platform challenges you're hiring this role to solve?
How does the team approach FinOps and cost accountability?
What does success look like in the first 90 days?

SECTION 5: PRE-INTERVIEW CHECKLIST

Customize all [PLACEHOLDER] and sample company references
Prepare 2–3 STAR stories (incident, cost optimization, migration)
Be ready to whiteboard hub-spoke + AKS architecture
Review Azure Service Health for any recent outages (shows awareness)
Have salary expectations and notice period ready
Test video/audio if virtual interview
Keep this doc open on second monitor for quick reference (don't read verbatim!)

SECTION 6: FOLLOW-UP QUESTION BANK (Senior Cross-Questions)

Questions interviewers ask after your main answer. Prepare 2–3 sentences each.

Q1–3: Intro / Architecture

Follow-up	Senior answer hint
Why Azure over AWS for this project?	Gov compliance, client requirement, AKS + Azure DevOps integration, existing Entra ID
What was your biggest architecture challenge?	Private AKS + PE for all PaaS + multi-state latency via Front Door
How many environments?	Dev, QA, Prod — separate namespaces/subscriptions, promotion via pipeline

Q4–7: Azure Services / AKS

Follow-up	Senior answer hint
Private vs public AKS cluster?	Private API server, private nodes, Bastion for admin, no public kube API
How do pods pull from ACR privately?	PE on ACR + Managed Identity AcrPull + Private DNS
Node pool failed to provision?	Quota, subnet full, NSG blocking, SKU unavailable — check `az aks nodepool list` events
AKS upgrade failed mid-way?	Surge nodes, cordon/drain, rollback node pool to previous K8s version

Q8–10: Kubernetes / CI/CD

Follow-up	Senior answer hint
Deployment vs StatefulSet?	Deployment = stateless; StatefulSet = stable identity + PV for DB
How do you rollback a bad deploy?	`helm rollback`, same ACR tag, never deploy `latest` in prod
SonarQube quality gate failed — what now?	Fix code or get exception approval; never bypass gate in gov prod
GitLeaks found a secret?	Revoke key in Azure, rotate in Key Vault, rewrite git history or invalidate commit

Q11–12: Networking

Follow-up	Senior answer hint
NSG vs Azure Firewall?	NSG = L3/L4 subnet/NIC rules; Firewall = centralized L3–L7 app rules, IDPS
PE vs Service Endpoint?	PE = private IP in VNet; Service Endpoint = VNet routing, service still has public IP
DNS not resolving for PE?	Private DNS Zone not linked to VNet, or AKS using public DNS forwarder

Q5–6: Backup / DR

Follow-up	Senior answer hint
Where are VM backups stored?	Recovery Services Vault — not Blob (Velero uses Blob)
RTO vs RPO?	RPO = max data loss window; RTO = max downtime; define per tier
Tested restore ever?	Yes — GitLab VM restore drill, Velero namespace restore to QA

Extra Senior Questions (Added — May Be Asked)

#	Question	Why it matters
E1	Explain Workload Identity Federation step-by-step	You mentioned OIDC — they will probe
E2	How does AGIC route traffic?	App Gateway → AGIC → Ingress → Service → Pod
E3	Terraform state locking — how?	Blob backend + native state lock
E4	How do you secure pipeline service connections?	Workload Identity Federation, no secrets in YAML
E5	What Azure Policy do you apply on AKS?	No privileged containers, required labels, allowed images
E6	How do you handle secrets in Git?	GitLeaks in CI, Key Vault CSI in cluster, never commit
E7	Difference between LRS, GRS, ZRS?	Redundancy tiers for Storage and Backup vault
E8	How do you do zero-downtime deploy on AKS?	Rolling update + readiness probes + Helm maxUnavailable

Practice Senior Answer sections aloud. Use Important Points as a checklist before each mock question.

FilesExpand file tree

interview.md

Latest commit

History

interview.md

File metadata and controls

Azure & DevOps Interview Preparation Guide

🎯 SENIOR INTERVIEW MODE (Read Before Every Session)

How Senior Answers Differ from Junior

Document Format (per question)

Your Profile (Mock Interview — Live Updates)

How to Use This Document

Mock Interview Progress

SECTION 1: INTRODUCTION & BACKGROUND

1. Please introduce yourself and explain your current role and responsibilities.

Senior Answer (8–9 Years)

2. How many years of experience do you have in Azure and DevOps?

Senior Answer (8–9 Years)

3. Describe your recent project architecture.

Senior Answer (8–9 Years)

4. Which Azure services have you worked on extensively?

Senior Answer (8–9 Years)

Important Points

Likely Follow-Up Questions

5. Have you managed Azure VMs, Storage Accounts, and Backup services?

Senior Answer (8–9 Years)

6. Have you configured Disaster Recovery or Azure Backup?

Senior Answer (8–9 Years)

7. Have you created or managed AKS clusters?

Senior Answer (8–9 Years)

Important Points

Likely Follow-Up Questions

8. What are Pods, Deployments, Services, and Namespaces?

Senior Answer (8–9 Years)

9. Which CI/CD tools have you used?

Senior Answer (8–9 Years)

Important Points

Likely Follow-Up Questions

10. Explain your deployment process from Dev to Production.

Senior Answer (8–9 Years)

Important Points

Likely Follow-Up Questions

11. What are VNets and NSGs?

Senior Answer (8–9 Years)

12. What is the difference between Public and Private Endpoints?

Senior Answer (8–9 Years)

Important Points

Likely Follow-Up Questions

13. Which monitoring tools have you used?

Senior Answer (8–9 Years)

Mathew's Answer — What You Said

14. How do you troubleshoot production issues?

Senior Answer (8–9 Years)

🎫 Incident Priority Guide (P0 / P1 / P2 / P3) — Learn This

15. Which databases have you worked with?

Senior Answer (8–9 Years)

16. What activities do you perform in database administration?

Senior Answer (8–9 Years)

17. Have you handled production incidents?

Senior Answer (8–9 Years)

Important Points

Likely Follow-Up Questions

18. Have you coordinated with clients and cross-functional teams?

Senior Answer (8–9 Years)

SECTION 2: DEEP-DIVE TECHNICAL QUESTIONS

19. Explain the architecture of your current Azure environment.

Senior Answer (8–9 Years)

20. How do you troubleshoot a VM that is suddenly unreachable?

Senior Answer (8–9 Years)

21. How would you perform DR testing in Azure?

Senior Answer (8–9 Years)

Important Points

Likely Follow-Up Questions

22. Explain the complete AKS architecture.

Senior Answer (8–9 Years)

23. How do you perform AKS cluster upgrades?

Senior Answer (8–9 Years)

24. How do you troubleshoot pods in CrashLoopBackOff state?

Senior Answer (8–9 Years)

25. Explain HPA and Cluster Autoscaler.