---
title: Cloud Engineer Interview Questions & Answers (2026)
description: The cloud engineer interview questions that get asked in 2026 — compute, storage, networking, IAM, IaC, observability, cost, and HA/DR architecture — with real answers.
url: https://usegreenroom.app/blog/cloud-engineer-interview-questions
last_updated: 2026-06-20
---

← Back to blog

Roles

# Cloud engineer interview questions and answers

June 20, 2026 · 17 min read

![Cloud engineer interview questions and answers — cover from Greenroom, the AI mock interviewer](/assets/blog/cloud-engineer-interview-questions-hero.webp)

Cloud engineer interviews are mostly vendor-agnostic even when they're framed around AWS, Azure, or GCP — the interviewer wants to know you understand the underlying primitives (compute, storage, networking, identity), can automate infrastructure instead of clicking through a console, and can reason about availability, cost, and failure under pressure. This guide covers the **cloud engineer interview questions** that actually get asked, organized by area, with a real answer for each one and a note on what it's actually testing. (See also our [AWS](/blog/aws-interview-questions) and [DevOps](/blog/devops-engineer-interview-questions) guides.)

## Core services & networking

### What are the main compute options, and when do you pick each one?

Virtual machines (EC2, Azure VMs, GCE) give you full control of an OS and are right when you need long-running processes, specific kernel/OS dependencies, or lift-and-shift of an existing workload. Containers (ECS/EKS, AKS, GKE) package an app with its dependencies for consistent behavior across environments and faster, more granular scaling than a full VM — the default choice for most modern services. Serverless functions (Lambda, Azure Functions, Cloud Functions) run code in response to events with no server management and scale to zero, ideal for spiky, short-lived workloads, but they have cold-start latency and execution time limits that make them a poor fit for long-running or steady-throughput jobs. The interview signal is whether you can match workload shape to compute model instead of defaulting to "containers for everything."

### What's the difference between object, block, and file storage?

Object storage (S3, Azure Blob Storage, Google Cloud Storage) stores immutable blobs with metadata, accessed over HTTP by key — ideal for static assets, backups, logs, and data lakes, but not mountable as a filesystem and not great for frequent small updates. Block storage (EBS, Azure Disks, Persistent Disk) presents raw, low-latency volumes attached to a single VM at a time, the right choice for database data directories and anything needing filesystem semantics and high IOPS. File storage (EFS, Azure Files, Filestore) is a shared, POSIX-compliant filesystem mountable by multiple instances simultaneously, useful for shared config, CMS uploads, or legacy apps that expect an NFS mount. Picking the wrong one shows up immediately in an interview: object storage for a database, or block storage for a multi-instance shared mount, are both wrong answers.

### Explain a VPC, subnets, and the difference between security groups and NACLs.

A VPC (Virtual Network in Azure terms, VPC in GCP) is an isolated, software-defined network you control inside a cloud provider — your own private address space. Subnets divide that VPC into smaller ranges, typically a public subnet (route to an internet gateway, for load balancers and bastion hosts) and private subnets (no direct internet route, for app servers and databases). Security groups are stateful, instance-level firewalls — allow a rule in one direction and the response traffic is automatically permitted — while network ACLs are stateless, subnet-level firewalls where you must explicitly allow both inbound and outbound traffic. Most real architectures rely on security groups for day-to-day access control and use NACLs sparingly, as a coarse-grained backstop at the subnet boundary.

### Load balancing — explain Layer 4 vs Layer 7, and when you'd choose each.

A Layer 4 (network) load balancer routes based on IP and port without inspecting the request itself — it's fast, protocol-agnostic, and the right choice for raw TCP/UDP traffic or when you need extreme throughput with minimal latency overhead. A Layer 7 (application) load balancer reads the actual HTTP request — path, headers, host — and can route `/api/*` to one service and `/static/*` to another, terminate TLS, and do content-based routing, at the cost of slightly more processing per request. Most web applications use an L7 load balancer (ALB, Azure Application Gateway, GCP HTTP(S) Load Balancer) in front of their services specifically because path- and host-based routing is what lets you run multiple services behind one entry point.

### How does auto-scaling actually work, and what metrics drive it?

An auto-scaling group (or VM scale set / managed instance group) maintains a target number of healthy instances, scaling out when a metric crosses a threshold — typically CPU utilization, request count per target, or a custom CloudWatch/Monitor metric — and scaling back in when load drops, within configured min/max bounds. Health checks feed back into this: an instance that fails them gets terminated and replaced automatically, which is also how auto-scaling groups provide basic self-healing independent of load-driven scaling. The interview nuance worth mentioning: scaling policies need cooldown periods to avoid "flapping" — rapidly scaling out and back in when a metric oscillates right at the threshold.

### What does a CDN actually do, and how does DNS fit into the picture?

A CDN (CloudFront, Azure CDN, Cloud CDN, or Cloudflare) caches content at edge locations physically closer to users, cutting latency for static assets and reducing load on your origin servers — and increasingly it also caches dynamic API responses and absorbs DDoS traffic before it reaches your infrastructure. DNS is the layer that resolves a domain name to an IP address, but in cloud architectures it does more: weighted routing for gradual rollouts, latency-based routing to the nearest healthy region, and health-check-driven failover that points traffic away from an unhealthy endpoint automatically (Route 53, Azure Traffic Manager, Cloud DNS). A common follow-up: explain why a CDN cache-invalidation after a deploy can cause a brief spike in origin load — useful to have an actual answer for rather than just naming the service.

## Identity, IaC & automation

### Explain IAM, least privilege, and why it actually matters.

IAM systems (AWS IAM, Azure AD/Entra ID, Cloud IAM) separate identities (users, service accounts, roles) from policies (what those identities can do) so permissions are explicit and auditable rather than implicit. Least privilege means granting only the permissions an identity needs to do its job, nothing more — a service that reads from one S3 bucket should have a policy scoped to that bucket and the `GetObject`/`ListBucket` actions, not blanket `s3:*` access. A concrete failure mode interviewers like hearing about: a CI/CD deploy role given `AdministratorAccess` "to make the pipeline work" during a rushed setup, which later means a single leaked credential or a misconfigured pipeline step can delete production resources, rotate other users' keys, or exfiltrate data — the blast radius of over-permissioning is exactly the scenario least privilege exists to prevent. Roles (assumed temporarily, with rotating credentials) are preferred over long-lived user access keys for almost everything machines do.

### Walk through Terraform vs CloudFormation vs Pulumi — and what a state file actually is.

CloudFormation is AWS-native and uses YAML/JSON templates with no separate state file — AWS tracks resource state for you, but you're locked to one cloud. Terraform is cloud-agnostic (AWS, Azure, GCP, and hundreds of other providers) and uses its own declarative HCL syntax, tracking the infrastructure it manages in a state file that maps your config to real resource IDs. Pulumi takes the same state-tracking approach as Terraform but lets you write infrastructure in actual programming languages (TypeScript, Python, Go) instead of a DSL, which appeals to teams that want loops, functions, and tests around their infra code. The state file is the part interviewers probe hardest: it's how the tool knows what already exists, so a corrupted, lost, or out-of-sync state file is one of the most dangerous failure modes in IaC — which is why remote state (an S3 backend with DynamoDB locking, or Terraform Cloud) with locking is non-negotiable for any team beyond one person.

### What is idempotency and drift detection in infrastructure as code, and why do they matter together?

Idempotency means running the same IaC apply twice produces the same end state — applying a config that already matches reality is a safe no-op, not a duplicate resource or an error, which is what makes IaC safe to re-run in CI without manual checking first. Drift happens when the real infrastructure diverges from what's in code — someone manually changed a security group rule in the console, or a resource was deleted outside the pipeline — and a `terraform plan` (or equivalent) surfaces that drift by diffing actual state against desired state before anything is applied. The two concepts matter together because drift left unmanaged eventually causes a "surprise" apply that reverts someone's manual emergency fix, or worse, makes you afraid to run `apply` at all because you no longer trust what it will change — which defeats the entire point of IaC.

Here's a minimal Terraform resource block defining an EC2 instance with a tag, the kind of snippet interviewers expect you to read fluently even if you can't recite every argument from memory:

```
resource "aws_instance" "web" {
  ami           = "ami-0abcdef1234567890"
  instance_type = "t3.micro"
  subnet_id     = aws_subnet.public.id

  tags = {
    Name = "web-server"
    Env  = "production"
  }
}
```

### How does CI/CD work for infrastructure, as opposed to application code?

The pipeline stages mirror app CI/CD but with an extra safety gate: lint/validate the IaC syntax, run `plan` (or its CloudFormation/Pulumi equivalent) to compute exactly what would change, post that plan for human review on anything touching production, then `apply` only after explicit approval — the plan-then-apply split exists because infrastructure changes can be destructive in ways application deploys usually aren't (deleting a database, replacing a subnet). For the actual infrastructure being deployed to, blue-green deploys spin up a full parallel environment and cut traffic over once it's healthy, giving instant rollback by just cutting back; canary deploys shift a small percentage of traffic to the new version first and watch error rates and latency before continuing the rollout. Both patterns exist to convert "deploy and hope" into "deploy and verify before fully committing," which is the property interviewers are actually checking for when they ask this.

### How do you manage secrets and environment-specific configuration?

Secrets (API keys, database passwords, certificates) belong in a dedicated secrets manager — AWS Secrets Manager, Parameter Store, Azure Key Vault, or HashiCorp Vault — never in source control, never in plain environment variables baked into an image, and never in CI logs, which is a surprisingly common leak vector when a script accidentally echoes a variable. These tools support automatic rotation, fine-grained access policies, and audit trails of who read what secret and when, which a `.env` file checked into git fundamentally cannot give you. Non-secret configuration that legitimately varies by environment (API base URLs, feature flags, resource sizing) is better handled through environment-specific config injected at deploy time or a config service, keeping the same build artifact deployable to staging and production without rebuilding it — a property interviewers sometimes call "build once, deploy many."

## Observability & cost

### What are the three pillars of observability, and how do metrics, logs, and traces actually differ?

Metrics are numeric time-series data (CPU usage, request rate, error count) — cheap to store and great for dashboards and alerting thresholds, but they tell you *that* something is wrong, not *why*. Logs are timestamped, often unstructured event records that give you the detail to diagnose a specific failure, at the cost of being expensive to store and search at scale if not structured well. Traces follow a single request across every service it touches in a distributed system, showing exactly where latency or errors were introduced — essential once you have more than a couple of services, because a slow endpoint could be slow in any one of five downstream calls and only a trace tells you which. The honest answer interviewers want: you need all three working together, because each one alone leaves a gap the others fill.

### How do you design alerting that doesn't create noise — symptoms vs causes?

Alert on symptoms that affect users — elevated error rate, high latency, a failed health check — because those are the things that actually need a human to respond right now, regardless of root cause. Avoid alerting directly on every possible cause (CPU at 85%, one pod restarted, disk at 70%) because those fire constantly without necessarily affecting users, training people to ignore alerts entirely — the classic alert-fatigue failure mode where a real incident gets missed in a sea of noise. The follow-up interviewers like: causes are still useful, just as supporting context attached to a symptom-based alert (a dashboard link, recent deploy markers, related metrics) rather than as the trigger itself, so the on-call engineer gets the "why" immediately after being told the "what."

### How do you approach cost optimization on a cloud bill that's grown out of control?

Start by rightsizing — most accounts are running instances far larger than their actual CPU/memory usage justifies, found by looking at utilization metrics over a real time window, not guessing. Move predictable, steady-state workloads to reserved instances or savings plans (up to ~70% cheaper than on-demand for a 1-3 year commitment) and tolerant batch/stateless workloads to spot instances (up to ~90% cheaper, but can be reclaimed with short notice). Tier storage by access pattern — hot data in standard storage, infrequently accessed data in cheaper tiers (S3 Infrequent Access/Glacier, Azure Cool/Archive), and set lifecycle policies so data ages into cheaper tiers automatically instead of someone remembering to move it. Finally, idle resource cleanup catches the silent waste: unattached EBS volumes, unused Elastic IPs, forgotten dev/test environments left running over a weekend, and old snapshots nobody deletes — none of these show up as a single dramatic line item, but together they're often a meaningful chunk of a bloated bill.

## Architecture, scenario design & disaster recovery

### Design a highly available web application architecture — walk through it step by step.

This is the cloud equivalent of a system design round, and interviewers want you to drive it the same way: clarify scope, then build up layer by layer. Start with a load balancer as the single public entry point, distributing traffic across an auto-scaling group of app servers spread across at least two (ideally three) availability zones, so the loss of one AZ doesn't take the whole app down. Behind that, use a managed database with automated multi-AZ failover for writes and one or more read replicas to offload read traffic, rather than scaling a single database instance vertically forever. Put a CDN in front of static assets and cacheable API responses to cut both latency and origin load, and add health checks at every layer — load balancer to app server, app server to database — so unhealthy nodes are pulled out of rotation automatically instead of silently serving errors. Layer in monitoring and alerting on user-facing symptoms (error rate, latency, saturation) across all of this, and close with a disaster recovery plan: what's your RPO and RTO if an entire region goes down, and which DR pattern (below) matches that target. Walking through it in this order — entry point, compute, data, edge caching, health, observability, DR — is exactly the structure interviewers are listening for, more than any single service name.

### What's the difference between RPO and RTO, and how do they shape your DR strategy?

RPO (Recovery Point Objective) is how much data you can afford to lose, measured as time — an RPO of 15 minutes means your backups or replication must be frequent enough that a failure never loses more than 15 minutes of data. RTO (Recovery Time Objective) is how long you can afford to be down — an RTO of 1 hour means your whole recovery process, from failure detection to serving traffic again, must complete within an hour. These two numbers, not vague statements like "we need high availability," are what should actually drive your DR architecture choice, because tighter targets cost more — there's a real, defensible trade-off being made, and interviewers want to hear you name it explicitly rather than reflexively answering "multi-region active-active" for every scenario.

### Explain the four standard disaster recovery patterns, from cheapest to most resilient.

**Backup and restore** is the cheapest and slowest: regular backups stored in another region, restored onto fresh infrastructure only when disaster strikes — acceptable RPO/RTO in hours, fine for non-critical internal systems. **Pilot light** keeps a minimal version of critical infrastructure (often just the database, replicating continuously) running in the DR region, with the rest provisioned from IaC only when needed — faster than backup-restore because the data layer is already warm. **Warm standby** runs a scaled-down but fully functional copy of the production environment in the DR region at all times, ready to take full traffic after scaling up — much faster recovery, at the cost of paying for that standby capacity continuously. **Multi-site active-active** runs full production capacity in two or more regions simultaneously, serving live traffic from both, giving near-zero RTO since there's no failover step at all — the most expensive and operationally complex pattern, reserved for systems where downtime is genuinely unacceptable. The right answer in an interview names the actual trade-off (cost vs RTO/RPO) rather than defaulting to the most resilient-sounding option.

<div class="verdict"><strong>The core truth:</strong> Cloud engineers architect for the "ilities" — availability, scalability, security, and cost. Knowing service names isn't enough; reasoning about trade-offs (cost vs availability, simplicity vs scale, RTO vs spend) out loud, with a specific scenario in front of you, is the real signal.</div>

## Practise explaining, not just memorizing

You can name every AWS service and still freeze when asked to walk through a highly-available architecture out loud, live, with an interviewer asking "what happens if that AZ goes down mid-deploy." Cloud rounds are architecture conversations, not quizzes, so the practice has to be conversational too. [Greenroom](/) runs spoken technical mock interviews that follow up on your reasoning the way a real interviewer would — pushing on a design choice, asking what you'd do differently under a tighter budget — and gives feedback on how clearly you explained the trade-off, not just whether you named the right service. Pair it with our [AWS](/blog/aws-interview-questions) and [system design](/blog/system-design-interviews-what-they-test) guides.

## Frequently asked questions

### What questions are asked in a cloud engineer interview?

Cloud engineer interviews cover core compute/storage/networking primitives (VMs vs containers vs serverless, object vs block vs file storage, VPCs and load balancing), identity and IAM with least privilege, infrastructure as code (Terraform, CloudFormation, Pulumi), secrets and configuration management, CI/CD for infrastructure, observability (metrics, logs, traces, alerting), cost optimization, and architecture scenarios covering high availability and disaster recovery (RPO/RTO).

### What is infrastructure as code, and why does the state file matter so much?

Infrastructure as code (IaC) is managing and provisioning cloud infrastructure through machine-readable definition files rather than manual console changes, using tools like Terraform, CloudFormation, or Pulumi to declare resources in version-controlled code. The state file is how a tool like Terraform knows what infrastructure actually exists and maps it to your config, so a lost, corrupted, or out-of-sync state file is one of the most dangerous failure modes in IaC — which is why remote state with locking is essential for any team beyond one person.

### How do you design a highly available cloud system?

Design for redundancy and no single point of failure: a load balancer distributing traffic across an auto-scaling group spread across multiple availability zones, a managed database with multi-AZ failover and read replicas, a CDN in front of static and cacheable content, health checks at every layer, and monitoring and alerting on user-facing symptoms. Pair this with a disaster recovery plan that defines your RPO (how much data loss you can tolerate) and RTO (how long you can be down).

### What's the difference between RPO and RTO?

RPO (Recovery Point Objective) is how much data you can afford to lose during a failure, measured in time — it determines how frequently you need backups or replication. RTO (Recovery Time Objective) is how long you can afford to be down before service is restored. Tighter targets on either cost more to achieve, and the four standard DR patterns — backup and restore, pilot light, warm standby, and multi-site active-active — trade cost against how close to zero they get your RPO and RTO.

### How do you manage secrets in a cloud environment?

Secrets like API keys and database passwords belong in a dedicated secrets manager — AWS Secrets Manager, Parameter Store, Azure Key Vault, or HashiCorp Vault — never committed to source control or baked as plain environment variables into an image. These tools support automatic rotation, scoped access policies, and audit trails, letting the same build artifact move between environments using environment-specific configuration injected at deploy time rather than rebuilt per environment.

### How should I prepare for a cloud engineer interview?

Learn the core services, networking, and IAM fundamentals across at least one major cloud provider, get comfortable reading and writing basic infrastructure-as-code, and then focus most of your prep on architecture trade-offs — high availability, cost, and disaster recovery — since reasoning through a scenario out loud is the real signal interviewers are scoring. Practise explaining a full HA design and a DR strategy verbally with a mock interview that asks realistic follow-up questions, because cloud rounds are architecture conversations, not service-name quizzes.

Cloud rounds reward architecting for the "ilities," out loud, under real follow-up questions. Greenroom runs spoken technical interviews that follow up on your reasoning. Free to start.
