---
title: Data Analyst Interview Questions & Answers (2026): SQL, Stats & Case Studies
description: The data analyst interview questions that get asked in 2026 — SQL, statistics, A/B testing, visualization and case studies — with clear, concrete answers.
url: https://usegreenroom.app/blog/data-analyst-interview-questions
last_updated: 2026-06-20
---

← Back to blog

Roles

# Data analyst interview questions and answers

June 20, 2026 · 36 min read

![Data analyst interview questions and answers — cover from Greenroom, the AI mock interviewer](/assets/blog/data-analyst-interview-questions-hero.webp)

It's 11:47pm, the night before your final-round data analyst interview, and you are staring at a take-home case study that's due "first thing tomorrow," which in practice means 9am, which in practice means you have ten hours, six of which you intend to spend asleep. The prompt is one paragraph: weekly active users dropped 15% last month, here is a CSV, tell us what happened and what you'd do about it. There's no schema diagram. There's no second paragraph. There is, you slowly realize, no hint at all about what they actually want — because the hint *is* the test. You open the spreadsheet, you stare at fourteen columns with names like `evt_ts` and `usr_seg`, and the cursor blinks at you with the specific patience of something that has all night.

This is the part nobody warns you about going into **data analyst interviews**: the SQL syntax is genuinely the easy part. You can drill `LEFT JOIN` until it's muscle memory and still completely fall apart the moment someone hands you an ambiguous dataset and says "so, what do you think happened?" — because that question isn't really about your query-writing speed. It's about whether you can think out loud, in front of a stranger, with no clean answer waiting at the end. Plenty of analysts who write flawless window functions freeze solid the second the prompt stops being "write a query" and starts being "convince me you understand the business."

That gap — between writing correct SQL alone at 11pm and explaining a finding clearly to a skeptical hiring manager at 9am — is what this guide is built around. We'll go deep on the actual **data analyst interview questions** that come up across SQL, statistics and A/B testing, messy data, visualization, and the open-ended case study round that quietly decides most offers, and at the end we'll talk honestly about why most people only prepare for half of that equation. Along the way: the SQL joins and window functions, GROUP BY/HAVING, deduplication, A/B test pitfalls, p-values explained the way you'd actually explain them to a stakeholder, chart selection, dashboard design, and two full worked case studies with the kind of follow-up questions a real interviewer would actually ask.

## SQL (the part everyone over-prepares for, slightly wrong)

SQL is still the single biggest technical filter in a data analyst interview — but most candidates prepare for it as if the goal is to recognize a pattern and reproduce syntax, when the actual goal is to write something correct under live observation while someone watches your cursor. Below are the SQL questions that come up constantly, with the reasoning behind each answer, not just the query.

### What's the difference between INNER JOIN, LEFT JOIN, and RIGHT JOIN?

An `INNER JOIN` returns only rows that have a match in both tables — if a customer has no orders, that customer disappears entirely from the result. A `LEFT JOIN` keeps every row from the left table regardless of whether it matches the right table, filling unmatched columns with `NULL` — this is the join you reach for when you specifically want to know "which customers have *zero* orders," since they show up with `NULL` order columns instead of vanishing. A `RIGHT JOIN` is the mirror image (keep everything from the right table), and in practice most analysts just flip the table order and use `LEFT JOIN` everywhere instead of switching join types, since it reads more predictably. The interview tell: if you reach for `INNER JOIN` by default without asking "could either side be missing rows I need to see," you'll silently drop data in a real report — and the candidate who catches that out loud, unprompted, is the one the interviewer remembers.

```sql
-- Customers with no orders (LEFT JOIN + NULL check)
SELECT c.customer_id, c.name
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
WHERE o.order_id IS NULL;
```

### What's a FULL OUTER JOIN, and when would you actually need one?

A `FULL OUTER JOIN` keeps every row from both tables, filling in `NULL` wherever there's no match on either side — it's the join you reach for when you need to audit two datasets against each other rather than enrich one with the other. The canonical interview scenario: reconciling two systems that are supposed to agree (a payments table and a CRM table that both claim to track the same customers) and finding rows that exist in one but not the other. Some database engines (MySQL historically) don't support `FULL OUTER JOIN` natively, and the workaround — a `LEFT JOIN` `UNION` a `RIGHT JOIN`, or `UNION ALL` with a dedup step — is itself a fair follow-up question, since it tests whether you understand what the join is *doing*, not just which keyword to type.

```sql
-- Find mismatches between two systems that should agree
SELECT a.customer_id, b.customer_id
FROM system_a a
FULL OUTER JOIN system_b b ON a.customer_id = b.customer_id
WHERE a.customer_id IS NULL OR b.customer_id IS NULL;
```

### GROUP BY + HAVING vs WHERE — what's the actual difference?

`WHERE` filters individual rows *before* any grouping happens; `HAVING` filters *groups* after `GROUP BY` has aggregated them, which is why you can't reference an aggregate function like `COUNT(*)` or `SUM(amount)` in a `WHERE` clause — those values don't exist yet at the row-filtering stage. A common interview trap: "find customers who placed more than 5 orders" requires `HAVING COUNT(*) > 5` after grouping by customer, not a `WHERE` clause, because "more than 5 orders" is a property of the group, not of any single row. A sharper follow-up some interviewers ask: "can you filter on both `WHERE` and `HAVING` in the same query?" — yes, and you usually should, since filtering rows early with `WHERE` before the (expensive) aggregation step is faster than aggregating everything and filtering groups afterward.

```sql
SELECT customer_id, COUNT(*) AS order_count
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY customer_id
HAVING COUNT(*) > 5;
```

### What are window functions, and when do you reach for one instead of GROUP BY?

`GROUP BY` collapses rows into one row per group, losing the original row-level detail. Window functions (`OVER (PARTITION BY ... ORDER BY ...)`) compute an aggregate or ranking *per row* while keeping every row intact — so you can show each employee's salary alongside their rank within their department, in the same result set. The two functions that come up constantly: `RANK()`/`DENSE_RANK()` for "find the second-highest value per group" (a classic interview question), and `LAG()`/`LEAD()` for comparing a row to the previous or next row in an ordered sequence — exactly what month-over-month growth needs. Know the difference between `RANK()` (ties share a rank, and the next rank skips — 1, 2, 2, 4) and `DENSE_RANK()` (ties share a rank, next rank doesn't skip — 1, 2, 2, 3); interviewers ask this specifically because it's the kind of detail that only shows up once you've been burned by it in a real report with duplicate values.

```sql
-- Second-highest salary per department
SELECT department, employee, salary
FROM (
  SELECT department, employee, salary,
         DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk
  FROM employees
) ranked
WHERE rnk = 2;
```

### Write a query to find month-over-month growth.

This is a direct test of `LAG()`. You compute the current period's value, the prior period's value via `LAG()` ordered by month, then the percentage change between them — and you have to handle the first row, which has no prior month to compare against and will return `NULL`, which interviewers specifically watch for you to acknowledge rather than silently ignore. The strongest answers say out loud, unprompted: "the first row will be NULL, which is correct — there's no prior month to compare against, and I wouldn't want to accidentally show a 0% or coerce it to a misleading number."

```sql
SELECT
  month,
  revenue,
  LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
  ROUND(
    (revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0
    / LAG(revenue) OVER (ORDER BY month), 2
  ) AS mom_growth_pct
FROM monthly_revenue
ORDER BY month;
```

### What's a CTE, and why would you use one instead of a subquery?

A Common Table Expression (`WITH name AS (...)`) names a temporary result set you can reference later in the same query, the same way a subquery does — but a CTE is readable top-to-bottom, can be referenced multiple times without repeating the logic, and (with `RECURSIVE`) can express things a plain subquery structurally can't, like walking an org chart up through unknown levels of management. A deeply nested subquery that's three levels deep and squeezed into a single `WHERE` clause is correct but borderline unreadable to anyone who didn't write it, including you in three months — CTEs are how you turn that into something a reviewer (or an interviewer reading your screen-share) can actually follow without unwinding the parentheses by hand. The interview signal isn't "do you know the syntax," it's "do you structure a multi-step query so a human can audit it," which is exactly the skill that matters once your queries get reviewed by other people.

```sql
-- Customers whose most recent order is also their largest (CTE chaining)
WITH recent_orders AS (
  SELECT customer_id, order_id, amount,
         ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
  FROM orders
),
max_orders AS (
  SELECT customer_id, MAX(amount) AS max_amount
  FROM orders
  GROUP BY customer_id
)
SELECT r.customer_id, r.order_id, r.amount
FROM recent_orders r
JOIN max_orders m ON m.customer_id = r.customer_id AND m.max_amount = r.amount
WHERE r.rn = 1;
```

### Second worked SQL problem: find customers who churned (no orders in the last 90 days, but had at least one order before that).

This question combines several of the patterns above and is a common "prove you can actually compose this stuff" follow-up after the warm-up questions. You need a CTE for last-order-date per customer, a filter for "had an order at all" (so you don't flag brand-new signups with zero history as churned), and a date comparison against today.

```sql
WITH last_order AS (
  SELECT customer_id, MAX(order_date) AS last_order_date, COUNT(*) AS total_orders
  FROM orders
  GROUP BY customer_id
)
SELECT customer_id, last_order_date
FROM last_order
WHERE total_orders > 0
  AND last_order_date < CURRENT_DATE - INTERVAL '90 days';
```

The interview-grade follow-up here is almost always "what's wrong with this definition of churn?" — and the honest answer is plenty: a customer who orders quarterly looks "churned" every single quarter under this rule, so a real churn definition should be relative to a customer's own typical purchase cadence, not one fixed global window. Naming that limitation unprompted is worth more than the query itself.

### How do you handle NULLs and duplicates in a query?

`NULL` isn't a value, it's the absence of one — `NULL = NULL` evaluates to `NULL`, not `true`, which is why you filter for missing values with `IS NULL`/`IS NOT NULL`, never `= NULL`. Functions like `COALESCE(value, 0)` substitute a default when a value is missing, and `COUNT(column)` silently skips `NULL`s while `COUNT(*)` counts every row — a distinction that quietly changes your denominator if you're not careful. For duplicates, `SELECT DISTINCT` removes exact-match duplicate rows, but real-world duplicate detection is usually partial-match (same customer, same day, different order ID from a double-submitted form) and needs `GROUP BY` with `HAVING COUNT(*) > 1` to surface candidates, followed by a `ROW_NUMBER() OVER (PARTITION BY ... ORDER BY created_at)` pattern to keep only the first or most recent row per group when you actually de-duplicate.

```sql
-- Keep only the most recent row per duplicate group
WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY customer_id, order_date ORDER BY created_at DESC
  ) AS rn
  FROM orders
)
SELECT * FROM ranked WHERE rn = 1;
```

Our [SQL interview questions guide](/blog/sql-interview-questions) goes deeper on joins, indexing, and query performance if you want the full technical-screen treatment beyond what a data analyst role typically needs.

## Statistics & A/B testing

### Mean vs median vs mode — when is each misleading?

The mean (average) is pulled hard by outliers — a handful of $50,000 enterprise deals will drag the "average deal size" far above what a typical deal actually looks like, making the mean a poor summary for skewed distributions like income, deal size, or session duration. The median (middle value when sorted) is robust to outliers and usually the better default for skewed data — "median session length" tells you what a typical user actually experiences. The mode (most frequent value) matters for categorical or multimodal data, like "the most common subscription tier," where a mean or median wouldn't even make sense. The interview signal: can you say *why* you'd pick one over another for a specific dataset, not just define all three — and the slightly more advanced version of this question is being asked to eyeball a histogram and say whether mean or median is the better summary statistic for what you're looking at, without being told the numbers.

### What's the difference between standard deviation and variance, and why does it matter?

Variance is the average of squared differences from the mean; standard deviation is the square root of variance, which brings the unit back to the original scale (variance in "dollars squared" is meaningless, standard deviation in "dollars" is interpretable). In practice, standard deviation tells you how spread out your data is around the average — two products with the same average rating but very different standard deviations have very different stories (consistently liked vs. polarizing). Interviewers often pair this with "what's a normal distribution and the 68-95-99.7 rule," checking whether you can translate a number into "roughly two-thirds of values fall within one standard deviation of the mean" — a framing borrowed from the kind of plain-language explanation you'll find in Khan Academy's statistics material, which is exactly the register interviewers are listening for, not a textbook formula recitation.

### Correlation vs causation — give a concrete example.

Correlation means two variables move together; causation means one variable's change actually produces the other's change — and mistaking one for the other is the single most common analytical error that gets a confident-sounding report quietly wrong. Classic example: ice cream sales and drowning deaths both rise in summer and correlate strongly, but ice cream doesn't cause drowning — a third variable (hot weather, more people swimming) drives both. In a business context: "users who used feature X had 30% higher retention" doesn't mean feature X *causes* retention — power users who were already going to stick around are also the ones most likely to discover and use feature X. Proving causation generally needs a controlled experiment (A/B test) or a strong quasi-experimental design, not just an observed correlation in historical data. The follow-up most candidates aren't ready for: "okay, so how *would* you test whether feature X actually causes retention?" — and the right answer is "randomize who gets access to it," not "look harder at the historical correlation."

### Explain statistical significance and p-values at a level a non-technical stakeholder would understand.

A p-value is the probability of seeing a result at least as extreme as what you observed, *if there were actually no real effect* — a small p-value (conventionally below 0.05) means the result would be unlikely under "nothing's going on," which is evidence (not proof) that something real is happening. It does not mean "5% chance the result is wrong," and it says nothing about how *big* or *important* the effect is — a tiny, business-irrelevant lift can still be "statistically significant" with enough sample size. The explanation that actually lands with a stakeholder: "if there were truly no difference between A and B, we'd see a gap this large only 3% of the time by random chance — that's small enough that we believe the difference is real," paired immediately with the actual size of the effect and whether it's big enough to matter. This is the exact kind of explain-it-simply moment that separates candidates who've memorized the definition from candidates who can actually do the job — it's also, not coincidentally, the kind of explanation a good analytics blog (Mode Analytics' blog has several solid breakdowns of this) tends to use instead of a textbook formula.

### Walk through how you'd design an A/B test, and the common pitfalls.

Start with a single, pre-registered success metric and a minimum sample size calculated *before* the test starts, based on your baseline conversion rate and the smallest effect size you'd actually care about detecting — running the numbers afterward to justify whatever happened is backwards. Randomize assignment cleanly (consistent per user, not per session, so the same person doesn't bounce between variants), and run for at least one full business cycle to avoid day-of-week effects skewing results. The single most common pitfall interviewers probe for is **peeking** — checking results daily and stopping as soon as you see significance, which inflates your false-positive rate dramatically, since random noise will cross the significance threshold sooner or later if you check often enough. Other pitfalls worth naming: novelty effects (a new design gets a temporary bump just because it's new), sample ratio mismatch (if your 50/50 split arrives as 55/45, something's broken in your assignment logic before you even look at the outcome metric), and multiple comparisons (testing five metrics at once and reporting whichever one happened to hit p < 0.05).

### What is statistical power, and why does it matter when designing a test?

Power is the probability of detecting a real effect, given that one actually exists — a low-power test can run for weeks and come back "not significant" not because there's no effect, but because the test was never capable of finding one that size with that sample. Power depends on three things you control before the test starts: sample size, the effect size you're trying to detect, and your significance threshold — which is exactly why "let's just run it for two weeks and see" is a weak answer, and "I'd calculate the sample size needed to detect a 2% lift at 80% power before deciding how long to run it" is the answer that gets nods. The practical business translation interviewers want to hear: an underpowered test that comes back inconclusive has wasted real traffic and real time, and that cost should be weighed against the cost of running it longer before the test even launches.

### How do you detect and handle outliers?

Detect them visually first — a boxplot or scatter plot surfaces obvious outliers faster than any statistical test — then quantify with the IQR method (anything beyond 1.5× the interquartile range above Q3 or below Q1) or z-scores (typically beyond 2–3 standard deviations from the mean) for roughly normal data. The harder part is deciding what to *do* about an outlier, which depends entirely on whether it's a data error or a real extreme value: a $9,999,999 order is probably a data-entry mistake worth fixing or excluding, but a genuinely huge enterprise deal is real signal you shouldn't throw away just because it's inconvenient. Interviewers want to hear that you investigate the *cause* before deciding to remove, cap (winsorize), transform (log-scale), or keep an outlier — blanket-deleting anything more than 2 standard deviations out is a red flag, not a best practice.

![Data analyst interview topics — SQL, statistics, visualization, case studies](/assets/blog/pool-feedback-report.webp)

Data analyst rounds test SQL, stats, and how you explain insight to non-technical people.

## Handling messy and missing data

### How do you decide whether to drop or impute missing values?

It depends on how much is missing and whether it's missing at random. If under ~5% of a column is missing and there's no pattern to which rows are affected, dropping those rows rarely biases your analysis. If missingness is *not* random — say, higher-income respondents skip the income question — dropping rows introduces bias, and imputation (filling with the mean/median, or a model-based estimate) is usually better, though you should flag imputed values rather than let them masquerade as real measurements. The interview signal: naming that missingness has a *mechanism* worth investigating (missing completely at random vs. missing not at random) shows more rigor than reflexively dropping or mean-filling every gap.

### What's your process for deduplicating a messy dataset?

First define what "duplicate" means for the specific dataset — exact row match is rare in practice; usually it's "same customer, same transaction amount, within a few minutes" from a retried form submission or a duplicate webhook delivery. Group on the fields that define a logical entity, count occurrences, inspect a sample of the flagged duplicates manually before writing any deletion logic, then keep a deterministic choice (most recent, lowest ID, or whichever record has the most complete data) rather than an arbitrary one. Always log or snapshot what you removed — silently deleting rows that turn out not to be true duplicates is a much worse failure mode than leaving a few extra rows in.

### A column has inconsistent text values like "NY", "N.Y.", "New York" — how do you clean it?

This is a categorical normalization problem: build a mapping table (or a `CASE`/regex normalization step) that collapses known variants to one canonical value, rather than trying to write one clever regex that handles every case at once. Lowercase and trim whitespace first to catch the easy variants, then handle the genuine synonyms explicitly — and always run a `SELECT DISTINCT` on the column before and after cleaning to confirm you caught everything and didn't accidentally merge two genuinely different values. This question is really testing whether you treat data cleaning as a verifiable, auditable step rather than a one-off script you eyeball once and move on from.

### How would you spot a tracking or instrumentation bug just by looking at the data, before anyone tells you something's wrong?

This question is really asking whether you sanity-check your own pipeline instead of trusting numbers blindly. Tell-tale signs: a metric that suddenly jumps to exactly zero or exactly the same value across every row (a default value masquerading as real data), a sharp step-change that lines up suspiciously well with a deploy date rather than any plausible real-world cause, a column whose distribution changed shape overnight with no corresponding business event, or a count that's suspiciously round (exactly 1,000,000 events is a number a real-world process rarely produces on its own). The habit interviewers want to hear named explicitly: before you write the report, plot the raw counts over time for any metric you're about to make a claim about — a silent tracking break shows up as a kink in the line graph long before it shows up as a wrong conclusion in your slide.

## Data visualization

### How do you choose between a bar chart, a line chart, and a scatter plot?

Bar charts compare discrete categories (revenue by region, signups by channel) — the bars' relative height is what the eye reads instantly. Line charts show a trend over a continuous sequence, almost always time (daily active users over the last quarter) — using a line chart for unordered categories implies a trend that doesn't exist and misleads the viewer. Scatter plots show the relationship between two numeric variables (ad spend vs. revenue) and are the right tool specifically when you want to surface or check a correlation, not just compare totals. The wrong-chart mistake interviewers ask about most: a pie chart with more than 4–5 slices, where relative slice sizes become impossible to compare by eye — a sorted bar chart almost always communicates the same data more clearly.

### What makes a dashboard usable for a non-technical executive?

A dashboard built for an exec should answer "is this good or bad, and what should I do about it" in under 10 seconds — that means leading with the 2-4 numbers that actually matter (not everything you *can* measure), showing trend and target/benchmark alongside the raw number so context is immediate, and using color sparingly and consistently (red/green tied to a real threshold, not decoration). Common mistakes: too many charts competing for attention, inconsistent time windows across panels so numbers can't be compared, and burying the one number that matters under five that don't. The deeper interview signal is whether you think about the *audience's* job, not your own analytical process — an exec dashboard and an analyst's exploratory workbook should look nothing alike. If your background includes Tableau or Power BI work, expect a follow-up on tool-specific dashboard mechanics — our [Tableau interview questions](/blog/tableau-interview-questions) and [Power BI interview questions](/blog/power-bi-interview-questions) guides go deeper on those specifically.

### What chart mistakes do you watch for in your own work before sharing it?

Truncated y-axes that exaggerate a small difference into a dramatic-looking spike, dual-axis charts that visually imply a correlation between two metrics on unrelated scales, too many colors/series on one chart so no single trend is readable, and unlabeled axes or units that force the viewer to guess. The habit that actually prevents these: before sending a chart, ask "if I removed all context and just looked at the shape, would I draw the right conclusion" — if the visual exaggerates the story more than the underlying numbers support, fix the chart, not the explanation you'll give verbally afterward.

## Spreadsheets still matter — what comes up about Excel/Sheets

SQL has taken over as the headline technical filter, but plenty of data analyst interviews — especially at smaller companies, agencies, and roles closer to operations or finance — still test spreadsheet fluency directly, and even SQL-heavy roles sometimes use Excel as a faster whiteboard for a quick concept check.

### VLOOKUP vs INDEX-MATCH vs XLOOKUP — what's the difference and when do you use each?

`VLOOKUP` searches the first column of a range for a match and returns a value from a column to its right — simple, but it breaks if a column gets inserted (since it references column position by a hardcoded number) and it can only look rightward, never left. `INDEX-MATCH` separates the two jobs: `MATCH` finds the row position of a value, `INDEX` returns the value at that row in any column you specify — which means it can look in either direction and survives inserted columns since it doesn't hardcode an offset. `XLOOKUP` (in modern Excel and Sheets) is the newer, cleaner version that does what INDEX-MATCH does in one function, defaults to exact match (VLOOKUP defaults to approximate match, a classic silent-bug source), and can return multiple columns at once. The interview signal: do you know *why* INDEX-MATCH became the long-standing best practice over plain VLOOKUP, not just that it exists.

### What's a pivot table actually doing, and when would you reach for one instead of a SQL GROUP BY?

A pivot table is a GUI-driven aggregation tool — drag a field into "rows," another into "values," pick SUM/COUNT/AVERAGE, and it produces the same kind of grouped summary a `GROUP BY` query would, without writing SQL. The honest answer to "when would you use one instead of SQL" is usually about audience and speed: a pivot table is faster to build for a one-off exploratory question, and — critically — it's editable live by a non-technical stakeholder who wants to drag a different field in and see the number change themselves, which a static SQL query result can't offer. For anything that needs to run repeatedly, be version-controlled, or operate on data too large for a spreadsheet to hold comfortably, SQL is the better tool. Knowing both, and knowing which one fits a given moment, is the actual signal.

### A stakeholder sends you a spreadsheet with formulas instead of values, and the numbers look wrong — what do you check first?

Trace the formula chain, starting from the cell that looks wrong: click into it, see what it references, and walk backward until you find either a hardcoded number that shouldn't be hardcoded (a classic spreadsheet failure mode — someone pasted a value over a formula months ago and nobody noticed) or a range reference that didn't expand when new rows were added below it. The second most common culprit is a `SUM` or `AVERAGE` range that was set once and never updated, silently excluding new rows — which is functionally the same bug class as a dashboard time window that doesn't update, just in spreadsheet form. Naming "I'd check whether the range is dynamic or hardcoded" out loud is exactly the kind of practical debugging instinct interviewers are listening for.

## Where most people actually prepare — and where it falls short

Here's the thing about the 11:47pm scenario from the intro: almost nobody arrives there unprepared in the sense of "doesn't know SQL." Most candidates have done real prep. The honest question is *which kind*, and whether it trains the part of the interview that actually decides the outcome.

**LeetCode and StrataScratch-style SQL practice.** These platforms are genuinely good for what they do — timed, graded SQL problems that build real fluency with joins, window functions, and the kind of query patterns covered above. The gap: they're entirely text-in, text-out. You write a query, you get a pass/fail, and nobody ever interrupts you mid-query to ask "why did you choose a window function over a subquery there?" — which is exactly what happens in a live screen-share interview the moment you stop typing. Drilling 150 SQL problems silently builds the same kind of one-dimensional fluency that LeetCode grinding builds for software engineers: real, necessary, and entirely insufficient on its own for a round that's actually graded on whether you can explain your reasoning out loud while someone watches.

**GeeksforGeeks-style "data analyst interview questions" dumps.** Useful for knowing what topics show up — several of the questions in this guide will look familiar if you've browsed one. The risk is mistaking a question list for an answer key. A GfG-style dump gives you the prompt and a model answer; it doesn't simulate the moment a real interviewer changes the question mid-stream ("okay, now what if 10% of that data is duplicated — does your number still hold?") specifically to see whether you actually understood the underlying logic or just memorized one fixed path through it.

**A friend's WhatsApp PDF of "questions they asked me."** Genuinely valuable for calibration — knowing that a friend's case study round was about churn instead of acquisition tells you something real about what to expect at that company. The limitation is that it's one data point from one candidate on one team, and "the case study they gave me" is not a guarantee of what you'll get. It also, notably, never tells you *how* your friend actually answered out loud, follow-ups and all — only the questions, never the performance.

**Generic ChatGPT mock-interview prompting.** Typing "ask me data analyst interview questions" into a chat window and typing your answers back is better than nothing, and it's genuinely fine for drilling factual recall — definitions, SQL syntax, statistics terminology. But it's a text-based, turn-taking exercise with no real-time pressure and no spoken delivery practice, and the follow-ups are only as sharp as the prompt you wrote — which means it's easy to accidentally write yourself an easy interview. Crucially, the data analyst case-study round is verbal: a real interviewer watches you think, interrupts you, and asks "why" the moment your explanation gets vague. None of that exists in a chat window, no matter how good the model answering you is.

The honest throughline: every one of these methods trains the writing-and-recall half of the interview well, and none of them touch the spoken half — clarifying an ambiguous prompt out loud, narrating your reasoning while someone watches, and explaining a finding simply under a follow-up question you didn't prepare for. That's the specific gap [Greenroom](/)'s spoken mock-interview format is built to close: you talk through a SQL problem or a case study out loud, the AI interviewer asks real follow-ups the way a real hiring panel does (a constraint change, a "why not segment by device instead" challenge, a request to explain your finding in one sentence to a non-technical exec), and you get feedback on clarity of reasoning, not just whether your final number was correct. It doesn't replace the SQL reps — you still need those — but it's the one method on this list that actually rehearses the part of the interview that happens in real time, out loud, with someone watching.

## Case studies — where interviews are actually won

### "Daily active users dropped 15% week-over-week — walk me through how you'd investigate."

Start by clarifying the metric itself before touching any data: is this measured correctly, did the tracking or instrumentation change recently, and is 15% actually unusual given normal week-to-week variance, or within the range of typical noise? Next, segment the drop to localize it — by platform (iOS vs. Android vs. web), by geography, by new vs. returning users, by acquisition channel — because "DAU dropped 15%" overall could mean "DAU dropped 60% on Android due to a broken release" with everything else flat, which is a completely different problem with a completely different owner. Once segmented, form specific, testable hypotheses (a recent deploy, a marketing campaign that ended, a competitor launch, a seasonal pattern from last year's data) and check each against the data rather than guessing out loud. Close by stating what you'd do next — who you'd loop in, what you'd monitor over the following days — and how you'd communicate the finding to a stakeholder in one sentence, not a slide deck. Interviewers are scoring the structure of your approach as much as the final answer, since in a real job there often isn't a clean answer to land on in 20 minutes.

The follow-up that actually separates strong candidates here: "say your segmentation shows the drop is concentrated entirely on Android — what do you do in the next ten minutes?" The weak answer keeps analyzing. The strong answer says something like "I'd flag it to the mobile team immediately, since a platform-specific cliff that sharp is almost always a release issue, and the cost of being five minutes late with that flag is much higher than the cost of one more segmentation cut I could do instead."

### "Revenue is flat but the team says the product is improving — how do you reconcile that?"

This tests whether you reach for segmentation instead of accepting an aggregate number at face value. Flat overall revenue can hide real improvement if, for example, new-customer revenue is growing nicely while churn from an older cohort is offsetting it exactly — two real, opposite stories canceling out into one boring topline number. The right move is to break revenue into components (new vs. existing, by cohort, by plan tier) and check each independently before concluding the product isn't working — and to ask what "improving" means concretely (engagement, NPS, feature adoption) so you can check whether *that* metric actually moved, separately from revenue, which often lags product changes by weeks or months.

### Second case study: "We launched a new onboarding flow a month ago. Signups are up 20%, but the team that built it says the launch was a failure. What's going on?"

This is a slightly nastier version of the same skill, because the surface metric (signups, up 20%) looks like unambiguous good news, and the job is figuring out why the people closest to the feature disagree with that read. The right opening move is the same as always: ask what metric the team actually cares about, because "signups" and "activated users" and "retained users" are three different funnels, and a flow that's great at getting people to sign up but bad at getting them to actually activate would produce exactly this disagreement — more top-of-funnel volume, worse-quality users flowing through it. A strong candidate checks downstream: did activation rate (the percentage of new signups who complete a meaningful first action) go up, stay flat, or drop, since a flow that front-loads enthusiasm but confuses people two steps later can absolutely produce "more signups, more frustrated users, net worse outcome." The case study is deliberately structured so that the headline number and the real story point in opposite directions — and the interviewer is watching whether you trust the headline number or go find the one that actually matters.

### How do you communicate a finding to someone who isn't technical?

Lead with the conclusion and the number that matters, in one sentence — not the methodology, not the caveats, not how you got there. "Checkout conversion dropped 8% after the redesign, concentrated entirely on mobile" lands; "I ran a comparison of pre- and post-redesign sessions segmented by device type and found a statistically significant decrease" does not, even though it's the same finding. Follow the headline with the one piece of context that changes what the listener should do (it's mobile-only, it started the day of the redesign), and hold the detailed methodology in reserve for whoever asks a follow-up — most won't, and the ones who do are exactly the people who should get the deeper answer.

<div class="verdict"><strong>The core truth:</strong> A data analyst's real job is turning numbers into decisions. Interviews reward the candidate who pairs solid SQL and statistics with the ability to <em>explain an insight simply</em> — because an analysis no one understands, however correct, is worthless to the person who has to act on it.</div>

![A structured screen-share interview in progress, candidate talking through a problem](/assets/blog/pool-structured-screen.webp)

A case-study round is a conversation, not a written exam — the interviewer is grading how you think out loud as much as what you conclude.

## Practise explaining, not just memorizing

Go back to that 11:47pm moment. The thing that actually determines whether tomorrow goes well isn't whether you know what a window function is — by the night before your final round, you almost certainly do. It's whether you can sit across from a stranger at 9am, with a vague prompt and a blinking cursor, and *say out loud*, clearly and in order, what you'd check first, what you'd check second, and why. That's a different skill from getting the right SQL syntax at 11pm with nobody watching, and it's the skill most prep methods quietly skip.

You can write a flawless window function and still freeze when an interviewer asks you to walk through a case study out loud with follow-up questions. The case-study and "explain this insight" rounds are spoken and structured — exactly the part that trips up technically strong candidates who've only ever practiced writing queries silently. [Greenroom](/) runs spoken data analyst mock interviews, pushes on your SQL and statistical reasoning with realistic follow-ups, and gives feedback on how clearly you explain a finding, not just whether the query ran. Pair it with our [SQL interview questions guide](/blog/sql-interview-questions), [how to explain your projects in an interview](/blog/how-to-explain-your-project-in-an-interview), and — if your role overlaps with stakeholder-facing analysis — [business analyst interview questions](/blog/business-analyst-interview-questions) and [coding-interview communication tips](/blog/coding-interview-communication-tips), which covers the same "narrate your reasoning out loud" skill from the software-engineering side.

## Frequently asked questions

### What questions are asked in a data analyst interview?

Data analyst interviews test SQL (joins, GROUP BY and HAVING, window functions, CTEs, deduplication), statistics and A/B testing (mean vs median, standard deviation, correlation vs causation, p-values, experiment design, statistical power, outliers), data visualization (choosing the right chart, dashboard design for non-technical audiences), handling messy and missing data, and open analytical case studies where you investigate a metric change and communicate findings clearly.

### How important is SQL for a data analyst interview?

SQL is the core technical filter for almost every data analyst role — you'll be asked to write joins, GROUP BY/HAVING aggregations, window functions, CTEs, and queries like second-highest value, ranking within groups, and month-over-month growth using LAG(). Strong, fluent SQL is often the single biggest determinant of whether you pass the technical screen, since it's usually evaluated live or on a timed take-home.

### What is a data analyst case study interview?

A case study gives you an open analytical prompt — for example "daily active users dropped 15% last week, how would you investigate?" Interviewers want a structured approach: clarify the metric, segment the data to localize the problem, form specific hypotheses, state what you'd query to test each one, and communicate findings clearly. It tests analytical reasoning and communication as much as raw technical skill.

### What are the most common A/B testing mistakes data analysts get asked about?

The most common is "peeking" — checking results daily and stopping as soon as you see significance, which inflates the false-positive rate well above the stated threshold. Others worth knowing: not calculating a required sample size or statistical power before starting the test, novelty effects inflating early results, sample ratio mismatch signaling a broken randomization, and running multiple metric comparisons without correcting for the fact that some will hit significance by chance alone.

### How do I pick the right chart type in an interview whiteboard exercise?

Match the chart to the question, not the data you happen to have: bar charts for comparing discrete categories, line charts for a trend over time, and scatter plots specifically when you want to show or check the relationship between two numeric variables. Avoid pie charts with more than 4-5 slices and truncated y-axes that exaggerate small differences — naming these as mistakes you actively avoid is itself a strong interview signal.

### How should I prepare for a data analyst interview?

Build fluent SQL (joins, window functions, CTEs, aggregation) and solid applied statistics including A/B testing fundamentals and statistical power, practise chart selection and dashboard critique, and rehearse open case studies end to end, ideally two or three different ones so you're not just pattern-matching to a single rehearsed structure. Crucially, practise explaining insights simply out loud, since the case-study and "explain this finding" rounds are spoken and trip up technically strong candidates who've only rehearsed silently. A voice-based mock interview that pushes on your reasoning and communication helps close that specific gap.

Data analyst rounds reward explaining insight simply, out loud. Greenroom runs spoken interviews that push on your SQL, statistical reasoning, and communication with feedback. Free to start.
