---
title: Data Scientist Interview Questions & Answers (2026): Stats, ML & Case Studies
description: The data scientist interview questions that get asked in 2026 — statistics, machine learning, feature engineering and case studies — with worked examples and how to prepare.
url: https://usegreenroom.app/blog/data-scientist-interview-questions
last_updated: 2026-06-20
---

← Back to blog

Roles

# Data scientist interview questions and answers

June 20, 2026 · 33 min read

![Data scientist interview questions and answers — cover from Greenroom, the AI mock interviewer](/assets/blog/data-scientist-interview-questions-hero.webp)

You're on a take-home video round for a data scientist role — the kind where you record yourself answering on Loom because the company couldn't schedule a live slot — and the prompt on screen says: "Explain a p-value to a five-year-old." You pause the recording. You un-pause it. You say "okay so, imagine you flip a coin," and then you stop, because you've just realized you don't actually know how to finish that sentence without using the word "probability," which a five-year-old does not know, and which you're now not sure *you* know either, despite having used p-values in four years of analysis. You re-record six times. The seventh take is the one you submit, and it still has you saying "...and that's basically how confident we are, sort of" with the specific cadence of a man drowning in shallow water.

This is the data scientist interview in miniature: not whether you know the concept — you do — but whether you can produce it on demand, out loud, under a clock, for an audience that isn't another data scientist. The same gap shows up on whiteboards. A candidate at a Bangalore analytics firm was once asked to sketch a **confusion matrix** during a panel round and drew it upside down — true positives in the bottom-right instead of top-left — confidently labeled precision and recall backwards, and didn't notice until the interviewer, very gently, asked "so in your matrix, what does a false negative cost the business?" and the candidate realized they'd just described a false positive. Nobody fails an interview for not knowing what a confusion matrix is. People fail because the matrix that lives cleanly in their head turns into a slightly-wrong sketch the second a real person is watching them draw it.

And it isn't just stats. A different panel, right after lunch, opened with "so — what's the bias-variance tradeoff?" to a candidate who had, twenty minutes earlier, eaten a slightly too-ambitious biryani and was now fighting a very different kind of variance in his own stomach. He gave a textbook-perfect answer about underfitting and overfitting on autopilot, the interviewer nodded, and then asked the follow-up that actually mattered — "okay, so for *this* churn model, which side of that tradeoff would you accept more of, and why" — and the autopilot ran out of fuel. Definitions are free. Applying them to someone else's exact problem, on the spot, is the entire interview.

This guide is the **data scientist interview questions** that actually come up in 2026 — statistics fundamentals, model evaluation, feature engineering, overfitting and validation, model selection, SQL/Python basics, messy-data cleaning, and a full worked case-study scenario — with real answers, not flashcard definitions, and a note on what each question is actually testing. (See also our guides to [data analyst interviews](/blog/data-analyst-interview-questions), [machine learning engineer interviews](/blog/machine-learning-engineer-interview-questions), and [SQL interview questions](/blog/sql-interview-questions).)

## Statistics and probability fundamentals

Every data science loop opens here, because statistics is the part you can't fake your way through with a well-organized GitHub repo. Interviewers use this section to separate people who took a stats class from people who *think* statistically — and the tell is almost always whether you can apply a concept to a concrete number, not just recite its name.

### What is a p-value, really — and can you walk through calculating one?

A **p-value** is the probability of observing a result *at least as extreme* as the one you got, assuming the null hypothesis is true. It is not the probability that the null hypothesis is true, and it is not the probability that your finding is "real" — both are the two most common wrong answers interviewers hear, and both reveal a memorized definition rather than an understood one.

Here's a worked example, the kind an interviewer might actually ask you to walk through live. Say you run an A/B test on a checkout page: the control converts 100 out of 2,000 visitors (5.0%), and the variant converts 130 out of 2,000 visitors (6.5%). Is that difference real, or noise?

1. **State the null hypothesis.** H₀: the true conversion rates are equal; any observed difference is due to random sampling.
2. **Pick a test.** For two proportions, a two-proportion z-test is standard. Pooled proportion: p̂ = (100 + 130) / (2000 + 2000) = 0.0575.
3. **Compute the standard error.** SE = √(p̂(1−p̂)(1/n₁ + 1/n₂)) = √(0.0575 × 0.9425 × (1/2000 + 1/2000)) ≈ 0.0074.
4. **Compute the z-statistic.** z = (0.065 − 0.05) / 0.0074 ≈ 2.03.
5. **Convert to a p-value.** A z of 2.03 on a two-tailed test corresponds to a p-value of roughly 0.042.

```text
p ≈ 0.042 → below the conventional 0.05 threshold →
reject H0 → the lift is "statistically significant" at the 5% level
```

That 0.042 means: *if the true conversion rates were actually identical*, you'd see a gap this large or larger only about 4.2% of the time by chance alone. It crosses the conventional 0.05 threshold, so you'd call the result statistically significant — but a good candidate immediately adds the caveat interviewers are listening for: statistical significance isn't the same as business significance, and a single 4,000-visitor test sitting right at the edge of 0.05 is a fragile thing to bet a product roadmap on. **Statistical significance** describes confidence in *whether* an effect exists, not how large or important it is.

### What is the Central Limit Theorem, and why does it matter in practice?

The **Central Limit Theorem (CLT)** states that the distribution of the *sample mean* approaches a normal distribution as sample size grows, regardless of the shape of the underlying population distribution — as long as samples are independent and the sample size is reasonably large (a common rule of thumb is n ≥ 30, though it depends on how skewed the original distribution is).

This matters practically because almost every hypothesis test, confidence interval, and A/B test calculation above leans on an assumption of normality somewhere — and the CLT is *why* that assumption is defensible even when the raw underlying data (revenue per user, time-on-page, anything heavily right-skewed) is nowhere close to normal. You don't need the individual user-level data to be normal; you need the *distribution of sample means* to be, and the CLT guarantees that once you're averaging over enough independent observations. The honest follow-up interviewers ask: what happens with a small sample, or with strongly dependent observations (like repeated measurements from the same user)? The CLT's guarantees weaken in both cases, which is exactly why small-sample A/B tests are so often misread as "significant" when they're really just noisy.

### Explain Type I and Type II errors with a concrete example.

A **Type I error** (false positive) is rejecting a true null hypothesis — concluding there's an effect when there isn't one. A **Type II error** (false negative) is failing to reject a false null hypothesis — missing a real effect. The standard mnemonic interviewers like to hear connected to a real scenario: in a fraud-detection model, a Type I error flags a legitimate transaction as fraud (annoying the customer, costing a sale), while a Type II error lets an actual fraudulent transaction through (costing the company money directly). These two errors trade off against each other — tightening your threshold to reduce one almost always increases the other — and the "right" balance depends entirely on which mistake is more expensive for the specific business, which is precisely the kind of judgment call interviewers want you to make explicit rather than leave implicit.

**Significance level (α)** is the Type I error rate you're willing to accept, conventionally 0.05. **Power** (1 − β) is the probability of correctly detecting a real effect when one exists — a chronically underpowered test (too small a sample) is exactly how real effects get missed and incorrectly written off as "no significant difference."

### What is a confidence interval, and what does "95% confidence" actually mean?

A **confidence interval** gives a range of plausible values for a population parameter, built from sample data. The frequently botched explanation: "95% confidence" does *not* mean there's a 95% chance the true value lies in this specific interval. It means that if you repeated the sampling process many times and built an interval the same way each time, about 95% of those intervals would contain the true population parameter. The true value is fixed; it's the *interval* that's random across repeated sampling, not the other way around. Interviewers ask this specifically because almost everyone gets it backwards on the first try, and noticing your own slip mid-answer is itself a decent signal.

Practically: a 95% CI of [4.2%, 8.8%] for the conversion lift in the earlier example tells a stakeholder the plausible range of the true effect, which is usually more useful for a real decision than a single point estimate and a binary "significant/not significant" verdict — a case-study answer that leads with the interval instead of just the p-value reads as more mature.

### Correlation does not imply causation — give a real example and explain how you'd actually test for causation.

The textbook line everyone knows: correlation measures whether two variables move together; causation means one variable's change actually *produces* the change in the other. The interview-winning version goes further and supplies a concrete confound. Ice cream sales and drowning deaths are correlated — both rise in summer — but neither causes the other; the confound is **temperature**. A subtler, more realistic example for a data science interview: users who engage with a new onboarding email convert at a higher rate, but it might be that *more motivated users* both open emails and convert, regardless of the email itself (a **selection effect**, not a causal one).

To actually test causation rather than just observe correlation, you reach for: a **randomized controlled experiment / A/B test** (the gold standard, since randomization breaks the link between the treatment and any confound); **natural experiments** or **instrumental variables** when randomization isn't possible; or **regression with confound controls** plus a healthy dose of humility about unmeasured confounders. The phrase interviewers want to hear is some version of "I'd want to randomize this if at all possible, and if I can't, here's the specific confound I'd worry about and how I'd try to control for it" — not just the famous aphorism on its own.

### Explain the bias-variance tradeoff.

**Bias** is the error from a model being too simple to capture the true pattern — it makes systematically wrong assumptions regardless of how much data you give it. **Variance** is the error from a model being too sensitive to the specific training data it happened to see — it would produce wildly different predictions if you retrained it on a slightly different sample. A useful analogy that lands well out loud: imagine throwing darts at a board. High bias, low variance is a tight cluster of darts that's consistently off-center — you're consistently wrong in the same way. Low bias, high variance is darts scattered all over the board but centered on the bullseye on average — you're right "on average" but wildly inconsistent on any single throw. The model you actually want lands a tight cluster *on* the bullseye, which in practice means deliberately trading a little bias for a meaningful reduction in variance, or vice versa, rather than chasing zero of either.

In model terms: a linear regression on a genuinely nonlinear relationship has high bias (it can't represent the curve no matter how much data you feed it) but low variance (it barely changes between training runs). A deep, unconstrained decision tree has low bias (it can fit almost any training set near-perfectly) but high variance (a slightly different training sample produces a very different tree). **Total error decomposes as bias² + variance + irreducible noise**, and the practical job of model selection, regularization, and cross-validation — all covered below — is finding the point on that curve where total error is lowest, not where either term individually hits zero.

![Data scientist interview topics — statistics, ML, case studies, and explaining findings clearly](/assets/blog/pool-feedback-report.webp)

Data science interviews reward technical correctness *and* the ability to explain a finding in plain language — most candidates over-index on the first and skip the second.

## Supervised vs unsupervised learning, classification vs regression

### What's the difference between supervised and unsupervised learning?

**Supervised learning** trains a model on labeled examples — input features paired with a known correct output — and the model learns to map from one to the other. Spam classification, churn prediction, and house-price estimation are all supervised: you have historical examples with a known answer attached. **Unsupervised learning** works on unlabeled data and looks for structure without being told what the "right answer" is — clustering customers into segments, reducing dimensionality for visualization, or detecting anomalies that don't match any known pattern. The interview follow-up that separates real understanding from memorized definitions: "when would you reach for unsupervised learning even though you technically *could* get labels?" — the honest answer is usually cost (labeling is expensive or slow), or genuine exploratory uncertainty about what categories even exist in the data before you've looked.

### What's the difference between classification and regression?

Both are supervised learning, split by what the target variable looks like. **Classification** predicts a discrete category — will this user churn (yes/no), which of five support-ticket categories does this message belong to. **Regression** predicts a continuous numeric value — how much revenue will this customer generate next quarter, what will tomorrow's demand be. The same underlying algorithm family often supports both with a different final layer (logistic regression for classification, linear regression for the continuous case; the same is true for decision trees, random forests, and gradient boosting, which all have classifier and regressor variants). The trap interviewers watch for: using a regression metric to evaluate a classifier, or treating an ordinal classification problem (low/medium/high) as either pure regression or pure unordered classification, when it's neither — that's a genuine modeling judgment call worth naming out loud.

### How do you decide which model to use — logistic regression, random forest, or gradient boosting?

This question isn't really asking you to rank algorithms — it's asking whether you reason about tradeoffs or just reach for whatever's fashionable. **Logistic regression** is the right starting point when you need interpretability (a coefficient you can explain to a regulator or a non-technical stakeholder — "each additional support ticket increases churn odds by X%"), when the relationship between features and the target is roughly linear in log-odds, when you have limited data, or when inference speed at scale matters more than squeezing out the last few points of accuracy. **Random forests** handle nonlinear relationships and feature interactions without manual engineering, are robust to outliers and irrelevant features, and need very little tuning to get a solid baseline — a good default when you want strong performance fast and don't need full interpretability. **Gradient boosting** (XGBoost, LightGBM, CatBoost) typically wins on tabular-data leaderboards and squeezes out the most accuracy, at the cost of more careful hyperparameter tuning, longer training time, and a real risk of overfitting if you're not validating carefully.

The answer that actually impresses an interviewer is a decision framework, not a single favorite: "I'd start with logistic regression as a baseline to understand the signal and check interpretability requirements, try a random forest to see if nonlinear interactions matter, and reach for gradient boosting only if the accuracy gain is worth the added tuning and explainability cost for this specific use case" — naming the real tradeoff (accuracy vs. interpretability vs. development time) rather than declaring one algorithm objectively "best."

## Model evaluation: precision, recall, F1, ROC/AUC, and when accuracy lies to you

This is where the upside-down confusion matrix from the cold open actually bites people. Get the matrix backwards and every metric you compute from it is backwards too — which is exactly why interviewers ask you to draw it rather than just define precision and recall in the abstract.

### Draw and explain a confusion matrix.

A confusion matrix lays out predicted vs. actual class for a binary classifier:

```text
                     Predicted: Positive   Predicted: Negative
Actual: Positive     True Positive (TP)    False Negative (FN)
Actual: Negative     False Positive (FP)   True Negative (TN)
```

**True Positive (TP)**: model correctly predicted positive. **False Negative (FN)**: model missed a real positive (predicted negative, actually positive). **False Positive (FP)**: model wrongly flagged a negative as positive. **True Negative (TN)**: model correctly predicted negative. Every downstream metric — precision, recall, F1, accuracy — is just arithmetic on these four cells, so memorizing the *formulas* without being able to place a result in the right cell, under pressure, is exactly the failure mode from the cold open.

### Explain precision, recall, and F1 — and when you'd prioritize each.

**Precision** = TP / (TP + FP) — of everything the model flagged as positive, what fraction was actually positive. **Recall** (sensitivity) = TP / (TP + FN) — of everything that was actually positive, what fraction did the model catch. **F1 score** = 2 × (precision × recall) / (precision + recall) — the harmonic mean of the two, useful when you want one number balancing both.

Concrete worked numbers: a fraud model flags 50 transactions as fraudulent. Of those, 30 are genuinely fraudulent (TP = 30, FP = 20) — precision = 30/50 = 60%. The dataset actually contained 40 fraudulent transactions total, so the model missed 10 (FN = 10) — recall = 30/40 = 75%. F1 = 2 × (0.6 × 0.75) / (0.6 + 0.75) ≈ 0.667.

Which one you optimize for is a business decision, not a math decision: a cancer-screening model should prioritize **recall** — missing a real case (a false negative) is far costlier than a false alarm that gets ruled out by a follow-up test. A spam filter should weight **precision** more heavily — wrongly burying a real client email in spam (a false positive) is often worse than letting one extra spam message through. Naming this tradeoff explicitly, tied to the specific cost asymmetry of the scenario you're given, is the actual signal — reciting the formulas is table stakes.

### What is the ROC curve and AUC, and how do they differ from precision/recall?

The **ROC curve** plots the true positive rate (recall) against the false positive rate at every possible classification threshold, and **AUC** (area under that curve) summarizes overall ranking quality into one number from 0.5 (no better than random) to 1.0 (perfect separation). The key thing ROC/AUC capture that a single precision/recall pair doesn't: they evaluate the model across *all* thresholds at once, which is useful when you haven't yet decided where to set the cutoff, or when you want to compare two models' ranking ability independent of any specific threshold choice.

The honest limitation interviewers want you to know: **AUC can be badly misleading on imbalanced datasets**, because the false-positive-rate axis is computed against a large true-negative pool, so it can look deceptively high even when precision (which cares about the composition of *positive predictions specifically*) is poor. For a 1%-positive-rate fraud dataset, a **precision-recall curve** is usually the more honest diagnostic than ROC/AUC — this single fact, stated unprompted, is a strong signal that you've actually worked with imbalanced real data rather than only the balanced toy datasets that show up in tutorials.

### Why is accuracy a misleading metric for imbalanced classes?

If 99% of transactions are legitimate and 1% are fraudulent, a model that predicts "legitimate" for every single transaction scores **99% accuracy** while catching exactly zero fraud — a completely useless model that looks, by the single most commonly reported metric, like a near-perfect one. This is the single most common trap question in data science interviews precisely because it's so easy to fall into without noticing: accuracy treats every class as equally important and equally frequent, and real-world classification problems (fraud, churn, disease screening, rare-defect detection) are almost never balanced. The fix isn't a different formula — it's reaching for precision, recall, F1, or a precision-recall curve instead, and for the model-building side, techniques like class weighting, resampling (SMOTE, undersampling the majority class), or adjusting the decision threshold to reflect the real cost asymmetry between error types.

### What metrics do you use for a regression problem instead?

Regression doesn't have a confusion matrix, so the standard trio is different: **RMSE** (root mean squared error) penalizes large errors disproportionately because errors are squared before averaging, making it sensitive to outliers — appropriate when big misses are especially costly. **MAE** (mean absolute error) treats all errors linearly and is more robust to outliers — appropriate when you want a metric that isn't dominated by a few extreme cases. **R²** (coefficient of determination) reports the proportion of variance in the target explained by the model, from 0 to 1 (or negative, if the model is worse than just predicting the mean) — useful for communicating "how much better than a naive baseline" to a non-technical audience, but it can be inflated by adding irrelevant features, which is why **adjusted R²** exists for comparing models with different numbers of predictors.

## Feature engineering: the part interviewers say matters more than model choice

A genuinely senior signal in a data science interview is volunteering, unprompted, some version of: "honestly, for this problem I'd spend more time on feature engineering than on which model I pick." Most public benchmarks bear this out — the jump from a mediocre feature set to a good one usually beats the jump from a mediocre model to a state-of-the-art one, and interviewers know this, so they listen for whether you reach for feature work before reaching for a fancier algorithm.

### How do you encode categorical variables?

**One-hot encoding** creates a binary column per category — clean and lossless for low-cardinality features (a handful of categories), but it explodes column count and creates sparse, mostly-zero data for high-cardinality features (thousands of unique values, like zip codes or user IDs). **Label/ordinal encoding** maps categories to integers — appropriate only when there's a genuine order (small/medium/large), and actively misleading otherwise, since it implies a numeric relationship (small < medium < large, with equal spacing) that doesn't exist for, say, city names. **Target encoding** replaces a category with a statistic of the target variable for that category (e.g., the historical churn rate for that city) — powerful for high-cardinality features, but it leaks target information into the features if you're not careful to compute it only on training folds, which is exactly the kind of leakage the case study below digs into.

### How do you scale or normalize features, and when does it matter?

**Standardization** (z-score: subtract the mean, divide by standard deviation) centers data at 0 with unit variance; **min-max normalization** rescales to a fixed range, usually [0, 1]. Scaling matters for any algorithm that computes distances or uses gradient descent — k-nearest neighbors, k-means clustering, SVMs, logistic regression, and neural networks all behave badly or train slowly if one feature ranges 0–1 and another ranges 0–1,000,000, because the larger-magnitude feature dominates the distance calculation or gradient regardless of its actual predictive value. It matters far less for tree-based models (random forests, gradient boosting), which split on thresholds per feature independently and are invariant to monotonic rescaling — a detail interviewers specifically probe to check you understand *why* scaling matters rather than applying it as a reflexive ritual on every project.

### How do you handle missing data?

There's no single correct answer — the right approach depends on *why* the data is missing, which is the first thing to say out loud rather than jumping straight to a technique. Missing **completely at random** (a sensor glitch unrelated to anything) is the easiest case; missing **at random** conditional on other observed features (younger users skip an income field more often) is more common and trickier; missing **not at random** (people with very high or very low incomes disproportionately skip the income question) is the dangerous case, because the *fact* that it's missing carries information you'll destroy if you handle it naively.

Common techniques: **deletion** (drop rows or columns) is simplest but wastes data and can bias results if missingness isn't random. **Mean/median/mode imputation** is fast but shrinks variance and can distort relationships between features. **Model-based imputation** (predicting the missing value from other features, e.g., via k-NN or a regression) preserves more signal but is more work and can leak information if not validated carefully. **Adding a "was this missing" indicator flag** alongside an imputed value is a strong default precisely because it preserves the information that something was missing, which — especially in the "missing not at random" case — is often itself predictive.

### What is feature selection, and how do you do it?

Feature selection narrows a feature set down to the ones that actually help, for three real reasons: faster training and inference, reduced overfitting risk (fewer parameters to spuriously fit to noise), and better interpretability. **Filter methods** (correlation with the target, chi-squared tests, mutual information) score each feature independently of any model — fast, but they miss feature *interactions*. **Wrapper methods** (recursive feature elimination, forward/backward selection) train a model repeatedly with different feature subsets and keep what improves performance — more accurate but computationally expensive. **Embedded methods** (L1/Lasso regularization, tree-based feature importance) bake selection into the model-training process itself — usually the best practical tradeoff, since you get selection "for free" as a byproduct of training a model you needed anyway.

### Why does feature engineering often matter more than model choice?

Because most of a model's predictive power comes from whether the *information it needs* is actually present and well-represented in the input, not from the specific algorithm doing the fitting. A logistic regression with a well-engineered "days since last purchase" feature will usually beat a tuned gradient-boosted tree fed only raw, unprocessed timestamps — the tree has to work much harder to rediscover a relationship a human could hand it directly. This is also why interviewers ask case-study questions that hinge on *what features you'd create*, not which algorithm you'd pick — the modeling step is often the easy part once the feature set is right, and a textbook like Hastie, Tibshirani, and Friedman's "An Introduction to Statistical Learning" spends as much time on this representation problem as it does on any specific algorithm.

## Overfitting, regularization, and validation

### How do you detect and handle overfitting?

**Overfitting** is when a model learns the noise and idiosyncrasies of the training set rather than the generalizable pattern — it scores great on training data and noticeably worse on anything new. You detect it by comparing training performance to validation/test performance: a large gap (high training accuracy, much lower validation accuracy) is the classic signature. You handle it through **regularization** (penalizing model complexity directly), **cross-validation** (so your performance estimate isn't fooled by a lucky single split), simplifying the model (fewer features, shallower trees, fewer parameters), gathering more training data, or early stopping (halting training once validation performance stops improving, even if training performance keeps climbing).

### Explain L1 vs L2 regularization.

Both add a penalty term to the loss function based on the size of the model's coefficients, discouraging overly large weights that fit noise. **L1 (Lasso)** adds the sum of *absolute values* of coefficients — its geometry tends to push some coefficients to exactly zero, which makes it a built-in feature-selection mechanism (it effectively decides some features aren't worth keeping). **L2 (Ridge)** adds the sum of *squared* coefficients — it shrinks all coefficients toward zero smoothly but rarely to exactly zero, which is preferable when you believe most features carry at least some signal and don't want to discard any outright. **Elastic Net** blends both, which is the practical default when you're unsure which behavior you want. The interview-grade detail to volunteer: L1's "exact zero" behavior comes from the geometry of its constraint region having corners (a diamond in 2D) where the optimum is more likely to land exactly on an axis, while L2's constraint region (a circle) has no corners to land on.

### Explain cross-validation, and write a simple k-fold example.

**Cross-validation** estimates how a model will perform on unseen data by training and evaluating it on multiple different splits of the available data, rather than trusting a single train/test split that might be lucky or unlucky. **K-fold cross-validation** splits the data into k roughly equal parts, trains on k−1 of them, validates on the held-out fold, and repeats k times so every fold serves as the validation set exactly once — then averages the k scores for a more stable performance estimate.

```python
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import numpy as np

kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = []

for train_idx, val_idx in kf.split(X):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    model.fit(X_train, y_train)
    preds = model.predict(X_val)
    scores.append(mean_squared_error(y_val, preds))

print(f"Mean CV RMSE: {np.sqrt(np.mean(scores)):.3f}")
```

The follow-up interviewers love: "what if your data has a time dimension?" Standard k-fold shuffles randomly, which for time-series data lets the model "see the future" during training and validate on the past — a subtle leakage bug. The fix is **time-series cross-validation** (also called walk-forward validation), where each fold's training data only ever precedes its validation data chronologically.

### What's the difference between train, validation, and test splits — and why do you need all three?

The **training set** fits the model's parameters. The **validation set** is used to tune hyperparameters and make model-selection decisions during development — and crucially, it's *also* a form of the data the model gets indirectly fit to, because you keep adjusting choices based on validation performance. The **test set** is touched exactly once, at the very end, to report an honest, unbiased estimate of real-world performance — if you ever use test-set performance to inform a decision and then go back and tweak something, it's no longer a valid test set; it's become a second validation set, and you've lost your one chance to know how the model actually performs on genuinely unseen data. This three-way split (or cross-validation in place of a single validation set, with the test set still held out separately) is the answer to give when an interviewer asks "what's wrong with just using train/test" — a two-way split tempts you into using the test set for tuning, which silently inflates your reported performance.

### What is a learning curve, and how do you use it to diagnose a model?

A **learning curve** plots model performance (often error or score) against training set size, with separate lines for training and validation performance. The shape diagnoses the problem: if both curves converge to a high error (poor performance) and are close together, the model is **underfitting** — more data won't help; you need a more expressive model or better features. If the training curve is low but the validation curve stays well above it with a persistent gap, the model is **overfitting** — more data, regularization, or a simpler model are the right levers. If both curves are converging nicely and the gap is closing as training size grows, more data is likely to keep helping. Reading a learning curve correctly, live, is a strong signal because it requires actually understanding bias and variance rather than just naming them.

## SQL, Python, and cleaning messy data

Data scientist interviews rarely go as deep on pure programming as a software-engineering loop, but a baseline is non-negotiable, and "I'm more of a stats person" is not an acceptable answer to a basic SQL question in 2026.

### What SQL should you actually know cold?

`JOIN` types (inner, left, right, full) and when each changes your row count; `GROUP BY` with aggregates (`COUNT`, `SUM`, `AVG`); window functions (`ROW_NUMBER()`, `RANK()`, `LAG()`/`LEAD()`) for things like "find each customer's most recent order" or "compute a running total" without a self-join; and `HAVING` vs `WHERE` (the former filters after aggregation, the latter before). A genuinely common interview question: "find the second-highest salary per department" — a clean answer uses `RANK()` or `DENSE_RANK()` partitioned by department, not a nested subquery with `LIMIT 1 OFFSET 1`, which breaks silently on ties. For the full pattern set, see our dedicated [SQL interview questions](/blog/sql-interview-questions) guide.

### What Python/pandas/numpy fluency do interviewers expect?

Comfortable, fluent use of `pandas` for filtering, grouping (`groupby().agg()`), merging dataframes, and handling missing values (`fillna`, `dropna`) without needing to look up syntax; `numpy` for vectorized array operations instead of Python loops, because looping over a pandas column is a real performance red flag interviewers notice immediately in a live coding round. The classic live-coding ask: "given this dataframe of transactions, find the top 3 customers by total spend in the last 30 days" — testing date filtering, grouping, sorting, and slicing in a few lines, not algorithmic cleverness. For deeper language-level questions, see our [Python interview questions](/blog/python-interview-questions) guide.

### How do you clean and prepare a genuinely messy dataset?

Walk through it as a process, because that's what's being tested, not any single trick: **profile first** (check dtypes, null counts, value ranges, and cardinality per column before touching anything); **handle duplicates** (exact and near-duplicate rows, which are sneaky when a key has trailing whitespace or inconsistent casing — `"NYC"` vs `"nyc"` vs `"NYC "` look like three different cities to a naive `groupby`); **standardize formats** (dates in three different string formats in the same column is extremely common in real exports); **handle outliers** deliberately rather than reflexively — decide if an extreme value is a data-entry error to fix, a genuine rare event to keep, or a value to cap/winsorize, and say which and why; and **validate against domain logic** (a negative age, a signup date after a churn date) catches errors that purely statistical checks miss. The signal interviewers want isn't a list of pandas functions — it's evidence you've actually been burned by messy data before and built habits in response.

## The case-study round, worked end to end: "build a model to predict customer churn"

This is where data science interviews are genuinely won or lost, and it deserves the same step-by-step treatment the frontend "design a typeahead search" round gets — because the skeleton is reusable across almost any open-ended business prompt you'll get ("how would you measure the success of a new feature," "sales dropped 20%, investigate," "design an experiment to test X").

**The prompt:** "We want to predict which customers will churn next month so the retention team can intervene early. Walk me through how you'd build this."

**1. Clarify the problem before touching data.** What does "churn" actually mean here — cancellation, or just inactivity for N days? What's the business action once we have a prediction (a discount offer, a retention call), and what does that imply about the cost of a false positive (wasted outreach) vs. a false negative (a customer we lose without trying)? What's the prediction horizon — predicting churn 30 days out is a different, more useful problem than predicting it the day it happens. A candidate who jumps straight to "I'd use XGBoost" without asking any of this has skipped the actual hard part of the question.

**2. Define the target variable carefully — and watch for leakage.** Churn labels are deceptively easy to get wrong. If "churned" is defined using data from *after* the prediction point (e.g., a "cancellation request" field that's only populated once a customer has already decided to leave), you've built a model that "predicts" something that already happened — **target leakage**, and it's the single most common way take-home data science exercises get quietly failed. The fix is being explicit about the **prediction point in time**: only use features known *as of* that date, and construct the label from outcomes strictly *after* it.

**3. Feature engineering, not just feature listing.** Recency/frequency/monetary (RFM) style features (days since last purchase, purchase frequency over the last 90 days, average order value trend), engagement signals (support tickets opened, login frequency trend — is it declining?), and contract/plan features (time since signup, plan tier, recent downgrades). The strongest answers here generate *derived* features (a slope of declining engagement over the last 60 days, not just a single snapshot number) rather than only raw columns — this is the feature-engineering-matters-more-than-model-choice point from earlier, applied live.

**4. Handle the realistic class imbalance.** Churn is usually a minority class — maybe 5% of customers churn in a given month. State this unprompted: you'd evaluate with precision/recall/F1 and a precision-recall curve, not accuracy; you might use class weighting or resampling during training; and you'd calibrate the final decision threshold against the actual cost asymmetry the business cares about (an outreach call is cheap, so you might deliberately accept lower precision for higher recall).

**5. Model selection, justified.** Start with logistic regression as an interpretable baseline — the retention team will likely want to know *why* a customer is flagged, not just that they are, so a model that can produce "this customer is flagged primarily due to declining login frequency and a recent downgrade" has real operational value beyond raw accuracy. Then try a gradient-boosted tree to see how much accuracy you're leaving on the table, and use SHAP values or feature importances to keep some interpretability even with the more complex model.

**6. Validate honestly.** Time-based train/validation/test splits, not random shuffling — you want to simulate genuinely predicting the future from the past, the same leakage concern as before but now at the validation-strategy level, not just the feature level.

**7. Close the loop back to the business.** State how you'd measure success post-launch — not just offline AUC, but a controlled rollout (hold out a control group that gets no intervention) so you can actually attribute retained customers to the model-driven outreach rather than to the campaign you'd have run anyway. This last step is frequently the one candidates skip entirely, and it's often exactly what separates a "technically fine" answer from a "this person has actually shipped a model before" answer.

<div class="verdict"><strong>The core truth:</strong> the case study isn't testing whether you can name the right algorithm — every candidate eventually gets to "gradient boosting" or "logistic regression." It's testing whether you ask the leakage question, the cost-asymmetry question, and the "how do we know it worked" question before anyone has to drag them out of you.</div>

## How candidates actually prepare — and where each method falls short

Most data science candidates prepare with some mix of four approaches, and — much like the Adobe loop in our [interview-prep guides](/blog/data-analyst-interview-questions) — each one trains a different slice of the real interview, and almost everyone over-invests in the slice that's easiest to do alone.

**Kaggle-notebook grinding.** Genuinely useful for building modeling intuition and seeing what good feature engineering looks like on real datasets. The gap: a Kaggle leaderboard rewards squeezing out the last 0.3% of accuracy on a clean, pre-packaged dataset with a fixed, unambiguous target — it does not reward stopping to ask "wait, what does churn even mean here" or "is this feature leaking the label," because Kaggle has already answered both questions for you before you opened the notebook. You can have a respectable Kaggle rank and still freeze on "walk me through how you'd define the target variable" because you've never had to define one yourself.

**GeeksforGeeks-style "data scientist interview questions" dumps.** Useful for knowing what topics show up, and several questions in this guide will look familiar if you've browsed one. The risk is the same as in any technical field: a question list gives you the prompt and sometimes a model answer, but it doesn't simulate an interviewer interrupting your p-value explanation to ask "wait, what does that 0.05 threshold actually mean in plain English" the moment you've slipped into jargon — which is precisely the failure mode from this guide's cold open.

**A friend's WhatsApp PDF of "ML interview questions."** Calibration value is real — knowing that a friend got asked about regularization or a churn case study at a similar company tells you roughly what to expect. The limitation is equally real: it's one data point from one candidate's one loop, often transcribed from memory days later, occasionally already slightly wrong by the time it reaches you third-hand.

**Generic ChatGPT prompting for case studies.** Typing "give me a data science case study" into a chat window is better than nothing for drilling the *structure* of an answer, and it's genuinely good for generating practice prompts. But it's a text exchange with no spoken delivery pressure, and the follow-ups are exactly as sharp as the prompt you wrote yourself — which means it's easy to accidentally interview yourself gently. It won't catch you mispronouncing "heteroscedasticity" under pressure, and it won't interrupt your explanation the way a live interviewer reflexively does the moment you say something vague.

The honest throughline: every one of these methods trains *what* to say. None of them train *saying it out loud, live, with someone listening for the moment your explanation gets fuzzy and pressing exactly there*. That's the specific gap [Greenroom](/)'s spoken mock-interview format is built to close — you explain a p-value, walk through a churn case study, or defend a model choice out loud, the AI interviewer asks real follow-ups the way a human panel does (a constraint change, a "why not logistic regression instead" challenge, a request to handle the imbalanced-class wrinkle), and you get feedback on the clarity of your reasoning, not just whether your final answer was technically correct. It doesn't replace knowing the material — you still need the statistics fundamentals and the case-study reps — but it's the only one of these methods that rehearses the verbal, interrupted, defend-your-choice format the real interview actually is.

![A structured interview practice session showing question flow and follow-ups](/assets/blog/pool-structured-screen.webp)

Case-study rounds reward a visible structure — clarify, define the target, engineer features, validate, close the loop — rehearsed out loud, not read silently off a page.

## Putting it together: how to actually prepare

Build the statistics fundamentals until you can produce a worked example on demand, not just recite a definition — the p-value walkthrough above is exactly the kind of thing to practice saying out loud, with real numbers, until it doesn't wobble. Get comfortable with the evaluation-metric vocabulary deeply enough that you can place numbers correctly into a confusion matrix under mild pressure, not just define precision and recall in the abstract. Treat feature engineering as seriously as model choice, because interviewers increasingly do. Internalize the leakage instinct — "what does the model know, and when does it know it" — since it's the single fastest way to look senior in a case-study round. And rehearse the case-study skeleton (clarify → define the target carefully → engineer features → handle imbalance → select and justify a model → validate honestly → close the loop on measuring real-world success) until it's a reflex, not a checklist you're reading off mid-answer.

Then practice saying all of it out loud, with follow-ups, before the real interview does it for you the hard way.

## Frequently asked questions

### What questions are asked in a data scientist interview?

Data scientist interviews test statistics and probability (p-values and hypothesis testing, the Central Limit Theorem, Type I/II errors, confidence intervals, correlation vs causation, the bias-variance tradeoff), machine learning fundamentals (supervised vs unsupervised learning, classification vs regression, model selection reasoning), model evaluation (precision, recall, F1, ROC/AUC, why accuracy is misleading on imbalanced data), feature engineering and overfitting (regularization, cross-validation, train/validation/test splits), SQL and Python basics, messy-data cleaning, and an open-ended case study like building a churn-prediction model.

### What statistics should I know for a data science interview?

Know p-values and hypothesis testing well enough to walk through a worked calculation, the Central Limit Theorem and why it underpins most statistical tests, Type I vs Type II errors with a real cost-asymmetry example, what a confidence interval actually means (and the common backwards explanation to avoid), correlation vs causation with a concrete confound example, and the bias-variance tradeoff explained with a clear analogy. Interviewers want you to apply these to a specific scenario and explain them in plain language, not just recite definitions.

### How do precision, recall, and F1 differ, and why does accuracy mislead on imbalanced data?

Precision is the fraction of predicted positives that are actually positive; recall is the fraction of actual positives the model caught; F1 is their harmonic mean. Accuracy is misleading on imbalanced data because a model that always predicts the majority class can score very high accuracy while catching zero of the minority class — for example, 99% accuracy on a 1%-fraud dataset while missing every fraudulent transaction. Precision, recall, F1, and precision-recall curves are the honest metrics for imbalanced problems.

### What is a data science case study interview, and how should I structure my answer?

A case study presents an open business problem, like "build a model to predict customer churn" or "sales dropped 20%, investigate." Structure your answer as: clarify the problem and the business action that follows from a prediction, define the target variable carefully while watching for data leakage, engineer meaningful features, handle realistic class imbalance, select and justify a model with the right tradeoff between accuracy and interpretability, validate honestly with time-aware splits, and close the loop by explaining how you'd measure real-world success after launch, not just an offline metric.

### What is data leakage and why do interviewers care about it so much?

Data leakage happens when a model has access, during training, to information it wouldn't actually have at prediction time — most commonly when the target variable is defined using data from after the prediction point, or when a feature is itself derived from the outcome. It produces models that look excellent in testing and fail in production because the leaked information won't be available when the model actually needs to make a prediction. Interviewers probe for it because catching a leakage risk unprompted is one of the clearest signals that a candidate has actually built and shipped models, not just trained them on clean benchmark datasets.

### How should I prepare for a data scientist interview?

Build statistics fundamentals to the point where you can produce a worked example on demand, not just a definition; get fluent with evaluation metrics and feature engineering, since interviewers weight these as heavily as algorithm choice; internalize the data-leakage instinct for case studies; and rehearse explaining your reasoning out loud, since data science interviews are as much a communication test as a technical one. Practising with a voice-based mock interview that asks realistic follow-up questions closes the gap that silent prep — reading question dumps or practicing alone — leaves open.

### Do I need to know deep learning for a data scientist interview, or is classical ML enough?

For most data scientist roles (as opposed to dedicated machine learning engineer or applied scientist roles), classical ML — regression, tree-based models, gradient boosting, clustering — and strong statistics cover the large majority of what's asked, because most business data science work is tabular data, not images or text requiring deep learning. Knowing when *not* to reach for a neural network, because a simpler model is more interpretable and just as accurate on structured data, is itself a stronger signal than knowing the latest architecture. If a role specifically involves NLP or computer vision, expect deep-learning fundamentals to be tested too — check our [machine learning engineer interview guide](/blog/machine-learning-engineer-interview-questions) for that track.

Data science interviews reward producing a clear, worked explanation under follow-up questions — not reciting a definition you've read a hundred times. Greenroom runs spoken mock interviews that push on your statistical reasoning and case-study communication, with feedback on every answer. Free to start.