---
title: Data Engineer Interview Questions & Answers (2026): Pipelines, SQL & Big Data
description: Data engineer interview questions for 2026 — ETL vs ELT, star vs snowflake schema, SCD types, Spark, Airflow, and a worked pipeline-design scenario with real answers.
url: https://usegreenroom.app/blog/data-engineer-interview-questions
last_updated: 2026-06-21
---

← Back to blog

Roles

# Data engineer interview questions and answers

June 21, 2026 · 33 min read

![Data engineer interview questions and answers — cover from Greenroom, the AI mock interviewer](/assets/blog/data-engineer-interview-questions-hero.webp)

It's 3:14am and your phone is doing the thing — that specific PagerDuty buzz pattern that means "you are not going back to sleep." The clickstream pipeline that feeds tomorrow's 10am demo to a VP has silently stopped writing to the warehouse six hours ago. Not crashed — *silently stopped*, the worst kind, because the DAG in Airflow shows a cheerful green checkmark on every task while the actual table has had zero new rows since 9pm. You open your laptop in the dark, squint at logs, and discover the upstream Kafka topic got a schema change at 8:54pm — somebody added a field, renamed another, and your Spark job has been quietly dropping every malformed record into a `_corrupt_record` column nobody ever set up an alert for. The demo is in six hours. You are, in the parlance of the field, going to be doing some "live data reconciliation" before sunrise.

This is, in miniature, the entire emotional arc of being a **data engineer**, compressed into one night: things that look fine are not fine, the failure is upstream of where you're looking, and the interview process for this job is designed almost entirely around finding out whether you've lived through a night like this one — or would panic and freeze the first time it happened to you for real. **Data engineer interview questions** in 2026 aren't really about trivia. They're about whether you understand why a pipeline failed silently, how you'd have caught it sooner, and whether you can design the next one so it fails loudly instead.

This guide covers the **data engineer interview questions** that actually come up — SQL and data modeling, star vs snowflake schema and slowly changing dimensions, ETL vs ELT, batch vs streaming, Apache Spark fundamentals, orchestration with Airflow, data quality, idempotent pipeline design, and a full worked **pipeline design interview** scenario — with real, detailed answers and the reasoning behind each one, not just a definition to memorize. We'll also cover where common prep methods (GeeksforGeeks-style dumps, a friend's WhatsApp PDF, generic ChatGPT prompting, generic "Spark interview questions" listicles) help and where they quietly leave you exposed, the same way that 8:54pm schema change left your pipeline exposed.

## SQL and data modeling

Every data engineering interview, regardless of company or level, opens with the assumption that your **SQL** is not just "good" but reflexive — the kind of fluency where you're not translating English into SQL in your head, you're thinking natively in joins and window functions. If you haven't already, it's worth pairing this guide with our dedicated [SQL interview questions guide](/blog/sql-interview-questions), because the questions below assume that baseline and build the data-modeling layer on top of it.

### What's the difference between a star schema and a snowflake schema?

A **star schema** has a single, flat **fact table** (the thing you're measuring — orders, page views, transactions) surrounded by **dimension tables** (the things you're slicing by — customer, product, date, region) that are each fully denormalized into one table per dimension. A **snowflake schema** takes those dimension tables and normalizes them further — so instead of one flat `product` dimension with a `category_name` column, you'd have `product` pointing to a separate `category` table, which might point to a `department` table. The star schema trades some storage and redundancy for query simplicity (fewer joins, easier for analysts and BI tools to reason about); the snowflake schema trades query complexity (more joins) for storage efficiency and stronger normalization, which matters more when dimensions are large and slowly changing. In practice, most modern analytics warehouses default to star schema because storage is cheap and join complexity is the thing that actually slows down both query performance and analyst comprehension — Snowflake and BigQuery's columnar, compute-on-demand pricing models make the "save storage" argument for snowflake schemas much weaker than it was a decade ago.

### What are fact tables and dimension tables, and how do you tell them apart when designing a schema?

A **fact table** stores the measurable events or transactions — numeric, additive values like `quantity`, `revenue`, `duration_seconds` — at a defined grain (one row per order, one row per page view, one row per minute of sensor data), plus foreign keys pointing to the dimensions that describe each fact. A **dimension table** stores the descriptive context around those facts — who, what, where, when — and is the table you `GROUP BY` and filter on. The practical test interviewers use: if a column is something you'd aggregate (`SUM`, `AVG`, `COUNT`), it belongs in the fact table; if it's something you'd filter or group by (a name, a category, a status, a date), it belongs in a dimension. Getting the **grain** of the fact table wrong — building it at "one row per order" when the business actually needs "one row per order line item" — is the single most common real-world data-modeling mistake, because it's invisible until someone asks a question the grain can't answer and you discover you've been double-counting or under-counting for months.

### Normalization vs denormalization — which do you use for an analytics warehouse, and why?

**Normalization** (3NF and beyond) eliminates redundancy by splitting data into many small, related tables — the right default for transactional (OLTP) systems where you're optimizing for fast, safe writes and data integrity, because updating a customer's address in one normalized table can't accidentally leave a stale copy somewhere else. **Denormalization** intentionally duplicates data into wider, flatter tables — the right default for analytical (OLAP) systems, because analytics workloads are read-heavy, join-heavy, and tolerant of some redundancy in exchange for query speed and simplicity. The practical answer interviewers want: a transactional `orders` database feeding a checkout flow should be normalized; the `fact_orders` table in your warehouse, built for analysts running ad-hoc aggregations, should be denormalized — which is exactly why ETL/ELT pipelines exist as the bridge between the two worlds, transforming a normalized operational schema into a denormalized analytical one.

### What are slowly changing dimensions (SCD), and what's the difference between Type 1, Type 2, and Type 3?

A **slowly changing dimension** is a dimension whose attributes change over time but not on every load — a customer changes address, a product changes category, an employee changes department — and the SCD type you choose determines whether you keep history of that change or just overwrite it.

**Type 1 — overwrite.** The old value is simply replaced; no history is kept. Use this when history genuinely doesn't matter (correcting a typo in a customer's name).

```sql
UPDATE dim_customer
SET city = 'Bengaluru'
WHERE customer_id = 4821;
```

**Type 2 — add a new row, version it.** The old row is kept and marked inactive (or given an end date), and a new row is inserted with the new value, a new surrogate key, and its own effective date range. This is the default choice when you need to know what a dimension's value *was* at the time a fact occurred — e.g., "what region was this customer in when they placed this order," which Type 1 cannot answer because it's already overwritten the old region.

```sql
-- close out the old version
UPDATE dim_customer
SET is_current = FALSE, end_date = CURRENT_DATE
WHERE customer_id = 4821 AND is_current = TRUE;

-- insert the new version
INSERT INTO dim_customer (customer_id, city, is_current, start_date, end_date)
VALUES (4821, 'Bengaluru', TRUE, CURRENT_DATE, NULL);
```

**Type 3 — add a new column.** Instead of new rows, you add a `previous_city` column alongside `current_city`, keeping only the immediately prior value, not full history. This is the right compromise when you only ever need to compare "current vs. last," not a full timeline — it's cheaper than Type 2 but loses anything more than one step back.

The interview follow-up that separates people who've memorized the types from people who've implemented them: "what happens to your fact table's joins when a Type 2 dimension changes?" The answer is that fact rows must reference the dimension's **surrogate key** (the versioned row), not its natural/business key, or every historical fact silently re-points to the customer's *current* city instead of the city that was true when the order happened — a bug that's invisible until someone asks a question that depends on history and gets a wrong answer with total confidence.

### Data warehouse vs data lake vs lakehouse — what's actually different?

A **data warehouse** (Snowflake, BigQuery, Redshift) stores structured, schema-on-write data optimized for fast SQL analytics — you define the schema before loading, which gives you strong consistency and query performance but makes it expensive to store raw, messy, or semi-structured data you're not sure you'll need yet. A **data lake** (raw files in S3, ADLS, or GCS, often as Parquet or JSON) stores data schema-on-read, cheaply, at any structure or scale, trading that flexibility for weaker query performance and a real risk of becoming a "data swamp" — a lake nobody can find anything useful in because there's no governance layer. A **lakehouse** (Databricks' Delta Lake, Apache Iceberg, Apache Hudi) is the attempt to get both: it adds a transactional metadata layer on top of lake storage, giving you ACID transactions, schema enforcement, time travel, and warehouse-like query performance directly on files sitting in cheap object storage, without forcing you to copy data into a separate proprietary warehouse first. The honest interview answer is a decision framework, not a single "best" tool: warehouses win when your workload is overwhelmingly structured BI/SQL; lakes win when you need to cheaply retain raw, varied, possibly-unstructured data (logs, images, sensor data) you haven't fully modeled yet; lakehouses are increasingly the default for new builds because they reduce the need to maintain both a lake *and* a separate warehouse with duplicated data and double the pipeline complexity.

![Whiteboard diagram of a data pipeline architecture with ingestion, processing, and storage layers](/assets/blog/pool-system-design.webp)

A pipeline-design whiteboard question rewards drawing the data flow first, then defending every box in it.

## ETL vs ELT

### What's the actual difference between ETL and ELT, and when do you use each?

**ETL (Extract, Transform, Load)** pulls data from a source, transforms it — cleaning, joining, aggregating, reshaping — in a separate processing layer (historically a dedicated ETL server, now often Spark), and only then loads the finished, transformed data into the destination warehouse. **ELT (Extract, Load, Transform)** flips the last two steps: it loads raw data into the destination first, and does the transformation *inside* the destination itself, using the warehouse's own compute. The shift from ETL to ELT as the default over the last decade isn't a fad — it tracks the fact that cloud warehouses like Snowflake, BigQuery, and Redshift got cheap enough and fast enough at scale that it stopped making sense to build and maintain a separate transformation tier when the warehouse could just do it, and keeping the raw, untransformed data around (rather than discarding it after ETL) turned out to be valuable for re-running transformations differently later without re-extracting from the source.

In practice: ETL still wins when you have strict compliance requirements to scrub or mask sensitive data *before* it's ever persisted anywhere (PII redaction before storage, not after), when your destination has limited compute relative to your source data volume, or when you're integrating with legacy on-prem systems that were architected before cloud-warehouse compute was an option. ELT wins for the majority of modern cloud-native stacks — tools like **Fivetran** or **Airbyte** handle the extract-and-load piece with minimal custom code, and **dbt** (data build tool) handles the transform step as SQL models that run inside the warehouse, version-controlled and testable like software. The interview signal isn't "which one is right" — it's whether you can articulate the actual tradeoff (compute location, compliance timing, raw-data retention) rather than reciting that "ELT is the new ETL."

### Name the actual tools you'd use to build an ETL or ELT pipeline in 2026, and what each one is for.

For extraction and loading: **Fivetran** and **Airbyte** are the standard managed connectors for pulling from SaaS APIs and databases into a warehouse with minimal custom code; **Debezium** handles change-data-capture (CDC) from operational databases, streaming row-level inserts/updates/deletes as events rather than doing slow full-table re-syncs. For transformation: **dbt** is the dominant tool for the "T" in modern ELT — SQL-based, version-controlled, testable transformation models that run inside the warehouse. For distributed processing at larger scale or with non-SQL logic: **Apache Spark** (batch) and **Apache Flink** or **Kafka Streams** (streaming, see below). For orchestration tying it all together: **Apache Airflow** (and increasingly **Dagster** or **Prefect** as lighter-weight, more Python-native alternatives). Knowing the *names* of these tools matters less than being able to say, for a given scenario, which piece of the stack each one occupies and why you'd reach for it — interviewers can tell instantly when a candidate has memorized a tool list versus actually wired these together.

## Batch vs streaming processing

### What's the real difference between batch and streaming processing, and how do you decide which one a use case needs?

**Batch processing** (the classic **Apache Spark** batch model, or even a nightly SQL job) processes a bounded, finite chunk of data — "yesterday's orders," "this hour's logs" — on a schedule, trading latency for throughput and simplicity: you can process huge volumes efficiently because you're not constrained by per-event overhead, but the data is only as fresh as your last run. **Streaming processing** (Kafka as the durable event log, with **Flink**, Kafka Streams, or Spark Structured Streaming as the processing layer) processes an unbounded sequence of events continuously, as they arrive, trading some throughput efficiency and a lot of operational complexity for low latency — results are seconds or sub-second fresh instead of hours fresh.

The decision isn't "streaming is more advanced so it's always better" — interviewers specifically watch for that wrong instinct. A nightly finance reconciliation job has no business being a streaming pipeline; the correctness requirements (exact, auditable, reproducible numbers) and the total absence of a latency requirement make batch strictly simpler and strictly better there. A fraud-detection system that needs to block a transaction *during* the transaction has no business being a batch job; by the time tomorrow's batch run flags the fraud, the money is gone. The honest framework: ask what the actual latency requirement is *in business terms*, not in engineering-aesthetic terms — "the VP wants this number by 9am" is a batch problem even if streaming would be more "impressive" to build, and "the user needs to see their own action reflected within 5 seconds" is a hard requirement that batch cannot satisfy no matter how fast you make the batch job.

### What is Kafka's role specifically, and how does it relate to Spark and Flink?

**Kafka** itself isn't a processing engine — it's a distributed, durable, append-only event log (a "commit log") that decouples producers of events from consumers of events, retains events for a configurable window (hours to indefinitely), and lets multiple independent consumers read the same stream at their own pace without affecting each other. **Spark Structured Streaming** and **Apache Flink** are processing engines that *consume from* Kafka (or similar sources) and apply transformations, aggregations, and windowing logic to the stream. The distinction that trips people up: Kafka is the pipe and the buffer, not the thing doing the computation — when someone says "we use Kafka for real-time analytics," what they actually mean is "we use Kafka to durably transport events to a separate streaming engine that does the analytics." For a deeper dive into Kafka's internals — partitions, consumer groups, offsets, exactly-once semantics — see our dedicated [Kafka interview questions guide](/blog/kafka-interview-questions).

## Apache Spark fundamentals

### What's the difference between an RDD and a DataFrame in Spark, and why did the ecosystem move toward DataFrames?

An **RDD** (Resilient Distributed Dataset) is Spark's original, lowest-level abstraction — an immutable, partitioned collection of objects distributed across a cluster, manipulated through functional transformations (`map`, `filter`, `reduceByKey`) that Spark lazily assembles into a execution plan and only actually runs when an action (`collect`, `count`, `save`) triggers it. A **DataFrame** is a higher-level abstraction built on top of the same distributed-execution engine, but organized into named columns with a known schema — conceptually closer to a SQL table or a pandas DataFrame — which lets Spark's **Catalyst optimizer** analyze your query plan and apply optimizations (predicate pushdown, column pruning, join reordering) that it simply cannot do with raw RDDs, because RDDs are opaque arbitrary Python/Scala objects with no schema for the optimizer to reason about. The ecosystem moved toward DataFrames (and the closely related Dataset API in Scala/Java) because the optimizer gains alone typically produce dramatically faster execution for the same logical operation, and the syntax is closer to SQL, which lowers the barrier for analysts and reduces bugs. RDDs still exist for genuinely low-level use cases — custom partitioning logic, non-tabular data, fine-grained control the optimizer would otherwise get in the way of — but the honest 2026 answer is "default to DataFrames (or Spark SQL) unless you have a specific reason not to."

### What is a shuffle in Spark, and why is it the thing everyone tries to avoid?

A **shuffle** happens whenever data needs to move *between* partitions across the cluster to complete an operation — a `groupBy`, a `join` on a non-co-partitioned key, a `repartition`. Unlike a `map`-style transformation that each partition can do independently and locally, a shuffle requires writing intermediate data to disk, transferring it across the network to the node that needs it for the next stage, and reading it back — which makes shuffles by far the most expensive operation class in Spark, often the single biggest determinant of a job's runtime. Interviewers ask about shuffles because diagnosing "why is this Spark job slow" almost always comes down to "where is it shuffling more data than it needs to, and can that be avoided." Common fixes: filtering and projecting columns *before* a join or groupBy rather than after (so you're shuffling less data), using `broadcast` joins when one side of a join is small enough to fit in memory on every executor (avoiding a shuffle of the large side entirely), and choosing partition keys that align with your most common join/groupBy keys so related data already lives on the same node.

### How does partitioning work in Spark, and what's a real example of getting it wrong?

Spark splits a dataset into **partitions** — the unit of parallelism, where each partition is processed by a single task on a single core. Too few partitions means you're under-using your cluster (some cores sit idle while a few overloaded partitions take forever); too many partitions means you're paying scheduling overhead on tasks too small to be worth distributing. A classic real example: ingesting a dataset with a heavily skewed key — say, 90% of e-commerce orders belong to one mega-retailer customer ID — and naively partitioning by `customer_id`. Every partition for the long tail of small customers finishes in seconds; the one partition holding the mega-retailer's orders takes twenty minutes alone while every other executor sits idle waiting for it, because Spark can't split a single key's data across multiple partitions by default. The fix is **salting** — appending a random suffix to the skewed key to artificially spread its rows across multiple partitions, then aggregating the partial results back together in a second pass — or repartitioning on a composite key that doesn't share the same skew. This is a genuinely common production Spark bug, and being able to describe it from a place of "I've debugged this" rather than "I read that this can happen" is a strong signal in an interview.

```python
# naive partition by skewed key — one partition becomes a bottleneck
df = df.repartition("customer_id")

# salting fix: spread the skewed key across N synthetic partitions
from pyspark.sql.functions import rand, floor, concat, lit

df = df.withColumn("salt", floor(rand() * 10))
df = df.withColumn("salted_key", concat("customer_id", lit("_"), "salt"))
df = df.repartition("salted_key")
# aggregate partial results, then combine across salt buckets in a second stage
```

## Orchestration — Airflow and DAGs

### What does an orchestrator like Airflow actually do, and why can't you just run pipeline steps with cron?

An orchestrator manages **dependencies between tasks** — which steps must finish before others can start, what happens on failure (retry, alert, skip), and gives you visibility into the whole pipeline's execution history in one place. Cron can trigger a single script on a schedule, but it has no concept of "don't run step C until steps A and B have both succeeded," no built-in retry/backoff logic, no dependency graph, and no UI showing you which of fifty scheduled jobs silently failed last Tuesday. **Apache Airflow** represents a pipeline as a **DAG** (Directed Acyclic Graph) — a set of tasks and the dependencies between them, which must form no cycles (a task can't depend, even indirectly, on itself) — and Airflow's scheduler figures out what's runnable right now based on what's already completed, handles retries with configurable backoff, and surfaces failures with logs and alerting rather than letting them disappear into a cron job's silently-discarded stdout.

```python
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
}

with DAG(
    dag_id="clickstream_daily_aggregate",
    schedule_interval="@daily",
    start_date=datetime(2026, 1, 1),
    default_args=default_args,
    catchup=False,
) as dag:

    extract = PythonOperator(task_id="extract_raw_events", python_callable=extract_events)
    validate = PythonOperator(task_id="validate_schema", python_callable=validate_schema)
    transform = PythonOperator(task_id="aggregate_sessions", python_callable=aggregate_sessions)
    load = PythonOperator(task_id="load_to_warehouse", python_callable=load_warehouse)

    extract >> validate >> transform >> load
```

The interview follow-up worth rehearsing out loud: "what happens if `load_to_warehouse` fails after `aggregate_sessions` already wrote its output?" The right answer touches idempotency (see below) — if re-running `load` would double-insert rows, your DAG isn't actually safe to retry, which defeats half the point of using an orchestrator with built-in retries in the first place.

## Idempotent pipeline design

### How do you design a pipeline so it's safe to re-run, and what does "idempotent" actually mean in this context?

A pipeline step is **idempotent** if running it multiple times with the same input produces the same end state as running it once — re-running it doesn't create duplicate rows, double-count revenue, or otherwise change the outcome beyond the first successful run. This matters enormously in practice because failures *will* happen — a network blip, a worker getting killed mid-job, an orchestrator retry firing automatically — and if your pipeline isn't idempotent, every failure-and-retry cycle is a live risk of silently corrupting your data, which is far worse than the job simply failing loudly and staying failed.

The most common non-idempotent pattern is a naive `INSERT` on every run — re-run the job twice and you've doubled every row. The fix is almost always one of: a) deleting and re-inserting the specific partition/date range you're about to write (`DELETE WHERE load_date = '2026-06-21'` then insert fresh), so reruns overwrite rather than append; or b) an upsert (`MERGE`/`INSERT ... ON CONFLICT`) keyed on a natural or business key, so a re-run updates existing rows instead of duplicating them.

```sql
-- idempotent upsert pattern: re-running this with the same batch
-- updates existing rows instead of duplicating them
MERGE INTO fact_orders AS target
USING staging_orders AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN
  UPDATE SET
    target.amount = source.amount,
    target.status = source.status,
    target.updated_at = source.updated_at
WHEN NOT MATCHED THEN
  INSERT (order_id, amount, status, updated_at)
  VALUES (source.order_id, source.amount, source.status, source.updated_at);
```

The interview signal here is whether you reach for idempotency *by design*, before anything has gone wrong, rather than treating it as a fix you'd bolt on after the first duplicate-data incident — because in production, there usually only is one such incident before someone asks "wait, why is revenue 2x for last Tuesday," and by then the cleanup is its own multi-hour project.

## Data quality and validation

### How do you catch bad data before it reaches your warehouse or your stakeholders?

Data quality has to be checked at multiple points, not just once at the end, because different problems are detectable at different stages. At ingestion: schema validation (does this event have the fields and types you expect — this is exactly what would have caught the 3am incident from the opening story before it silently dropped records for six hours). Mid-pipeline: row-count and null-rate checks against historical baselines (did today's batch have 40% fewer rows than the seven-day average — that's a strong signal something upstream broke even if every individual row looks valid). At the warehouse layer: tools like **dbt tests** (`unique`, `not_null`, `accepted_values`, custom SQL assertions) or **Great Expectations** run automated assertions against your transformed tables as part of the pipeline itself, failing the job loudly rather than letting bad data quietly land in a dashboard a VP looks at the next morning. The single biggest mindset shift interviewers are listening for: data quality isn't a one-time cleaning step, it's a continuous, automated gate built into the pipeline — the goal is for a data quality failure to look like a *failed pipeline run* with an alert, not a *wrong number* that someone notices three weeks later during a board meeting.

### How do you handle a pipeline failure or late-arriving data?

For failures: orchestrator-level retries with exponential backoff handle transient issues (a flaky network call, a momentarily overloaded source database); checkpointing (saving progress so a restart resumes from the last successful point rather than redoing the whole job) matters for long-running jobs where redoing everything from scratch after every blip is wasteful; and dead-letter handling (routing records that fail validation to a separate table or topic for later inspection, rather than either silently dropping them or crashing the whole batch over one bad row) keeps one malformed event from taking down an entire run.

For late-arriving data — events that arrive after the time window they logically belong to has already been processed, common with mobile clients that buffer events offline and sync hours later — the two real options are: a) define a **grace period** or **watermark** (common in streaming systems like Flink and Spark Structured Streaming) that holds a window open for a bounded amount of extra time before finalizing it, accepting that you're trading some latency for completeness; or b) accept that some lateness is unrecoverable within the original window and instead support **backfill** — reprocessing a historical partition once you know data arrived late, which is exactly why idempotent pipeline design (above) matters so much here too, since a backfill is just a deliberate re-run.

## Partitioning and bucketing for performance

### What's the difference between partitioning and bucketing, and when do you use each for query performance?

**Partitioning** physically splits a table's data into separate directories/files based on a column's value — most commonly a date, so `load_date=2026-06-21/` is a distinct physical location from `load_date=2026-06-20/`. The performance win is **partition pruning**: a query that filters on `WHERE load_date = '2026-06-21'` can skip reading every other date's files entirely, rather than scanning the whole table. The tradeoff: too many small partitions (partitioning by a high-cardinality column, or at too fine a grain) creates a "small files problem" where the overhead of opening thousands of tiny files outweighs the pruning benefit.

**Bucketing** splits data into a fixed number of buckets based on a hash of a column's value — useful for high-cardinality columns (like `user_id`) where partitioning would create too many tiny partitions, but where you still frequently filter or join on that column. Bucketing's specific superpower is making joins faster: if two tables are bucketed on the same key with the same number of buckets, a join between them can avoid a full shuffle, because matching rows are already guaranteed to live in correspondingly-numbered buckets. The honest interview answer: partition on low-cardinality columns you frequently filter by (date, region, status); bucket on high-cardinality columns you frequently join or filter by (user ID, product ID) where partitioning would explode into too many directories.

## Worked example — design a pipeline that ingests mobile clickstream events and makes them queryable within 5 minutes

This is the kind of open-ended pipeline-design question that shows up in mid-to-senior data engineering loops, and it rewards exactly the same "clarify, then build up layer by layer, narrating trade-offs" approach as a frontend or backend system-design question — the skeleton doesn't change, only the components do.

**1. Clarify requirements first.** What's "queryable" mean here — ad-hoc SQL by an analyst, or a fixed dashboard metric? What's the actual event volume (10K events/sec from a small app, or 500K/sec from a major consumer app) — this single number determines whether Kafka-plus-Flink is overkill or mandatory. Is "5 minutes" a hard SLA someone will page you over, or a soft target? Is some data loss during an outage acceptable, or does every event need a durability guarantee? Don't skip this step even under time pressure — an interviewer who watches you ask these questions before drawing a single box is already more confident in you than one who jumps straight to "we'll use Kafka."

**2. Ingestion layer.** Mobile clients batch and send events (often via a lightweight SDK that buffers offline and retries), hitting an ingestion endpoint that writes immediately into **Kafka** — Kafka's role here is exactly what we covered above: a durable, ordered buffer that decouples "the app sent an event" from "something processed it," so a downstream processing hiccup doesn't mean lost events, and multiple consumers (real-time aggregation, raw archival to a data lake, fraud-detection) can read the same stream independently.

**3. Stream processing layer.** A **Flink** (or Spark Structured Streaming) job consumes from Kafka, does light transformation — schema validation, deduplication on an event ID (mobile SDKs commonly emit the same event twice across a flaky retry, so dedup is not optional), and windowed aggregation (e.g., sessionization, rolling counts) — and writes results into a queryable destination. The 5-minute SLA is the thing driving the *entire* choice of streaming over batch here: a nightly or even hourly batch Spark job structurally cannot hit a 5-minute freshness target, no matter how well-tuned it is, because the latency floor of a batch architecture is "however often the batch runs," not "how fast each run executes."

**4. Storage/serving layer.** Write aggregated results into a destination tuned for the actual query pattern — a low-latency OLAP store (ClickHouse, Druid, or a warehouse like BigQuery/Snowflake with streaming inserts) if analysts need ad-hoc SQL within minutes, or a pre-aggregated key-value store if the only consumer is a fixed dashboard. Separately, archive the *raw* events to a data lake (S3/Parquet) — not because the 5-minute path needs it, but because you will eventually need to reprocess history (a bug fix, a new aggregation nobody thought of yet, a backfill), and the raw lake copy is your insurance policy for exactly that.

**5. Data quality and idempotency, applied to this specific pipeline.** Dedup by event ID at the stream-processing layer (point 3) handles the "mobile client retried and sent the same event twice" case. A dead-letter topic catches malformed events instead of crashing the Flink job over one bad payload. The aggregation writes should be upsert-based (point on idempotency above), so a Flink job restart after a crash doesn't double-count a window it had already partially written.

**6. Monitoring and failure modes.** Alert on consumer lag (Kafka consumer falling behind producer rate is the single earliest warning sign of an SLA breach — and is exactly the kind of metric that would have caught the 3am incident from this guide's opening hours before a human noticed). Alert on a sudden drop in event volume (the "40% fewer rows than baseline" check from the data-quality section, applied to a streaming context). And have a clear answer ready for "what happens if Flink falls badly behind" — usually some combination of scaling out consumers, and a documented decision about whether to drop the 5-minute SLA temporarily in favor of eventually catching up versus dropping data to keep pace.

The same skeleton — clarify requirements, ingestion, processing, storage/serving, quality and idempotency, monitoring — works for almost any pipeline-design prompt you'll get: "design a pipeline for fraud detection on payment events," "design an analytics pipeline for a ride-sharing app's trip data," "design a pipeline that keeps a search index in sync with a product catalog." The components change; the way you walk through it out loud, in order, naming trade-offs at each step, does not.

<div class="verdict"><strong>The core truth:</strong> data engineering interviews aren't testing whether you've memorized that ETL stands for Extract-Transform-Load. They're testing whether you've actually been paged at 3am for a silent pipeline failure, and whether you'd design the next one so the failure is loud, the rerun is safe, and the data quality check catches it before a VP does.</div>

## How candidates actually prepare — and where each method quietly falls short

Almost everyone preparing for a data engineering interview reaches for some combination of the same four resources. None of them are useless. All four train a different half of the actual interview, and the half they skip is exactly the half that decides offers.

**GeeksforGeeks-style "data engineer interview questions" dumps.** These are genuinely useful for a topic checklist — if you've read one, several of the headings in this guide will look familiar, and that's by design; the canonical topic list (ETL vs ELT, star vs snowflake, SCD types, Spark partitioning) is genuinely the right list. The gap is that a question dump gives you a static prompt and a static model answer, with no mechanism for an interviewer to change the constraint mid-answer — "okay, now the data source is 10x bigger, does your design still work?" — which is precisely the moment that separates a candidate who understood the trade-off from one who memorized the paragraph.

**A friend's WhatsApp PDF of "questions they asked me."** This is real, useful calibration — knowing that a specific company's loop spends 20 minutes on a live pipeline-design whiteboard rather than pure SQL trivia genuinely helps you prepare for the *shape* of the interview. The limitation is sample size: it's one person's experience on one team in one hiring cycle, forwarded through three group chats, possibly already slightly wrong by the time it reaches you. Treat it as orientation, never as a guaranteed syllabus.

**Generic ChatGPT mock-interview prompting.** Typing "ask me data engineer interview questions" into a chat window and answering in text is better than nothing — it's a fine way to drill factual recall (SCD type names, ETL vs ELT definitions). But it's a text-based, untimed, no-pressure exercise where you can quietly delete and rewrite an answer before hitting enter, which is not remotely the experience of explaining a shuffle-avoidance strategy out loud, live, to a stranger who's about to ask "why not just increase the partition count instead" the second you finish your sentence. It also tends to produce follow-ups only as sharp as your own prompt — easy to accidentally interview yourself gently.

**Generic "Spark interview questions" or "Airflow interview questions" listicles.** These narrow further than a full data-engineer dump, and they're a reasonable supplement once you already have the broader picture — but stacking five separate single-topic listicles doesn't substitute for being asked to design a pipeline end-to-end, where the actual skill being tested is connecting ingestion, processing, storage, quality, and monitoring into one coherent design, out loud, under a changing constraint.

The honest throughline: real data engineering interviews — especially the pipeline-design round — reward **reasoning out loud under a follow-up that changes the scenario**, not the ability to eventually produce a correct written paragraph with unlimited time and no one watching. That's the specific gap [Greenroom](/)'s spoken mock-interview format is built to close: you talk through a pipeline-design or Spark-debugging scenario out loud, the AI interviewer asks real follow-ups the way a real data engineering panel does (a 10x scale change, a "why not just batch this instead" challenge, a request to handle a failure mode you hadn't mentioned), and you get feedback on the clarity of your reasoning, not just whether your final answer matched a model solution. It doesn't replace actually knowing the material covered above — you still need to understand SCD types and shuffles cold — but it's the only one of these methods that rehearses the verbal, interrupted, defend-your-design format the real interview actually is.

## How to prepare for a data engineer interview

Build genuine SQL fluency first — joins, window functions, query optimization — since every other round assumes it as a baseline; our [SQL interview questions guide](/blog/sql-interview-questions) covers this in depth. Then build the data-modeling vocabulary above (star vs snowflake, fact vs dimension, SCD types) until you can sketch a schema for a new use case in under two minutes without hesitating. Get comfortable narrating a pipeline-design answer end-to-end — ingestion, processing, storage, quality, monitoring — the same way the worked clickstream example above does, because that skeleton covers the large majority of design prompts you'll actually be given. Know Spark's RDD/DataFrame distinction, what a shuffle actually costs, and at least one real partition-skew story, even a small one from a course project. Know what idempotency means concretely enough to write the upsert pattern above from memory. And since data engineering interviews increasingly probe overall system thinking, it's worth skimming our general [system design interview guide](/blog/system-design-interviews-what-they-test) and, if database internals come up heavily in your loop, our [database administrator interview questions guide](/blog/database-administrator-interview-questions) for the operational side of the same concepts. As Martin Kleppmann lays out at length in *Designing Data-Intensive Applications* — required reading for this field, and a book most strong data engineers have actually read rather than skimmed — the hard part of this job is rarely the happy path; it's reasoning correctly about what happens when a node fails, a network partitions, or data arrives twice. That's exactly what interviewers are listening for, and exactly what it's worth rehearsing out loud before the real thing.

Because the interview is verbal — a panel watching you think, asking "wait, what if X" right when you thought you were done — practice has to be verbal too. Reading this guide closely is necessary, but it's a different skill from producing a clear, structured, trade-off-aware answer under mild pressure with a follow-up you didn't plan for. [Greenroom](/) runs spoken data engineering mock interviews that ask realistic follow-ups on pipeline-design and SQL reasoning, and gives feedback on how clearly you explain your design decisions — not just whether the final answer was technically correct.

## Frequently asked questions

### What questions are asked in a data engineer interview?

Data engineer interviews cover strong SQL (joins, window functions, query optimization), data modeling (star vs snowflake schema, fact and dimension tables, slowly changing dimensions), ETL vs ELT, batch vs streaming processing, Apache Spark (RDDs vs DataFrames, partitioning, shuffles), orchestration with Airflow and DAGs, data quality and validation, idempotent and reliable pipeline design, handling pipeline failures and late-arriving data, partitioning and bucketing for performance, and at least one open-ended pipeline-design scenario question.

### What is the difference between ETL and ELT?

ETL (Extract, Transform, Load) transforms data before loading it into the destination warehouse, which suits cases with limited target compute, legacy on-prem systems, or strict compliance requirements to scrub sensitive data before it's ever persisted. ELT (Extract, Load, Transform) loads raw data first and transforms it inside a powerful cloud warehouse using tools like dbt, keeping raw data available for re-processing later. Most modern cloud data stacks favor ELT because warehouses like Snowflake and BigQuery can transform at scale efficiently, and retaining raw data turns out to be valuable when requirements change.

### What's the difference between SCD Type 1, Type 2, and Type 3?

SCD Type 1 overwrites the old value with no history kept, suited to corrections where history doesn't matter. SCD Type 2 keeps full history by inserting a new versioned row with its own effective date range whenever a tracked attribute changes, which is necessary when you need to know what a dimension's value was at the time a related fact occurred. SCD Type 3 adds a new column to track only the immediately previous value, a cheaper compromise that loses anything more than one step of history.

### How do you design a reliable, idempotent data pipeline?

Design every write so re-running it produces the same end state as running it once — using delete-and-reload on a partition or an upsert keyed on a natural ID, rather than a naive append-only insert. Handle failures gracefully with orchestrator retries, checkpointing, and dead-letter handling for malformed records. Validate data quality at multiple stages with automated assertions (schema checks, row-count and null-rate baselines, dbt tests or Great Expectations), and make the pipeline observable with logging, monitoring, and alerting so failures are loud rather than silent. Account explicitly for late-arriving data with watermarks or a documented backfill process.

### When should I use batch processing versus streaming for a pipeline?

Decide based on the actual business latency requirement, not which approach feels more advanced. Batch processing (Spark batch, scheduled SQL jobs) is simpler, more throughput-efficient, and the right choice whenever the data only needs to be as fresh as "by 9am tomorrow." Streaming processing (Kafka as the event log with Flink, Kafka Streams, or Spark Structured Streaming doing the processing) is necessary when results need to be seconds or sub-second fresh, such as fraud detection during a live transaction, at the cost of meaningfully more operational complexity.

### How should I prepare for a data engineer interview?

Build genuine SQL fluency and the core data-modeling vocabulary (star vs snowflake schema, SCD types, fact vs dimension tables) until you can use them without hesitating. Practice narrating a full pipeline design out loud end-to-end — ingestion, processing, storage, data quality, monitoring — since this skeleton covers most design prompts. Know Apache Spark's RDD vs DataFrame distinction and what makes a shuffle expensive, understand idempotent pipeline design concretely enough to write an upsert pattern from memory, and rehearse explaining all of this verbally with realistic follow-up questions, since real interviews rarely stop at your first answer.

Data engineering interviews reward pipeline reasoning that survives a follow-up question, not a memorized definition. Greenroom runs spoken data engineering mock interviews that follow up on your pipeline-design and SQL reasoning, with feedback on every answer. Free to start.
