Data engineers build the pipelines and infrastructure that move and shape data for analytics and ML. Interviews test strong SQL, ETL/ELT design, data warehousing, distributed processing (Spark), and data modeling. Here are the data engineer interview questions that actually get asked. (See also our SQL and data analyst guides.)
SQL & data modeling
- Strong SQL — joins, window functions, optimization (our SQL guide).
- Star vs snowflake schema; fact and dimension tables.
- Normalization vs denormalization for analytics.
- Slowly changing dimensions (SCD).
Pipelines & processing
- ETL vs ELT — the difference and when to use each.
- How do you design a reliable, idempotent data pipeline?
- Batch vs streaming processing.
- Apache Spark — RDDs, DataFrames, partitioning, shuffles.
- Orchestration — Airflow and DAGs.
Warehousing & quality
- Data warehouses vs data lakes vs lakehouses.
- How do you handle data quality and bad data?
- Partitioning and bucketing for performance.
- How do you handle a pipeline failure or late-arriving data?
How to prepare
Data engineering rounds mix SQL with pipeline design discussions. Practise explaining ETL/ELT and pipeline reliability out loud. Greenroom runs spoken technical interviews that follow up on your reasoning. Pair it with our SQL and system design guides.
Frequently asked questions
What questions are asked in a data engineer interview?
Data engineer interviews cover strong SQL (joins, window functions, optimization), data modeling (star vs snowflake schema, fact and dimension tables, slowly changing dimensions), ETL vs ELT, designing reliable idempotent pipelines, batch vs streaming, Apache Spark (RDDs, DataFrames, partitioning, shuffles), orchestration with Airflow, data warehouses vs lakes vs lakehouses, data quality, and handling pipeline failures.
What is the difference between ETL and ELT?
ETL (Extract, Transform, Load) transforms data before loading it into the destination warehouse, suiting cases with limited target compute or strict pre-load cleansing. ELT (Extract, Load, Transform) loads raw data first and transforms it inside a powerful cloud warehouse, leveraging its scalable compute and keeping raw data available. Modern cloud data stacks often favor ELT because warehouses like Snowflake and BigQuery can transform at scale efficiently.
How do you design a reliable data pipeline?
Design for idempotency so re-running a job produces the same result without duplicates, handle failures gracefully with retries and checkpoints, validate data quality at each stage, and make pipelines observable with logging, monitoring and alerting. Account for late-arriving and bad data, use orchestration like Airflow to manage dependencies, and partition data for performance. Reliability and reproducibility matter more than raw speed.
How should I prepare for a data engineer interview?
Build strong SQL and data modeling skills, understand ETL vs ELT, pipeline reliability and idempotency, Spark, and warehousing concepts. Since rounds mix SQL with pipeline-design discussions, practise explaining how you'd design a reliable pipeline and handle failures out loud with a voice-based mock interview that follows up on your reasoning.