You're 20 minutes into a "Prompt Engineer" interview at a Series B startup, screen-sharing a blank Claude window, when the interviewer says: "Okay — write me a prompt that gets reliable JSON out of this model for a customer support ticket classifier. Go ahead, think out loud." You type "You are a helpful assistant" — delete it. Type it again. Add "You are an EXPERT customer support AI" in caps, like volume fixes precision. Three minutes pass. The interviewer, mercifully, says "take your time," which is the interview equivalent of a doctor saying "this won't hurt much."
Here's the uncomfortable part: you've read every prompting thread on X, you've got a folder of "God-tier ChatGPT prompts" bookmarked, and you still freeze — because reading a good prompt and constructing one live, out loud, against a real failure mode the interviewer just described are different skills, and almost nobody practices the second one. This guide covers the prompt engineering interview questions that actually come up in 2026 — prompting techniques, LLM fundamentals, live prompt-writing exercises, evaluation and guardrails, and a prompt-pipeline system design round — organized by area, with a real answer for each one and a note on what it's actually testing.
What a prompt engineering interview actually tests
Is "Prompt Engineer" still a real job title in 2026?
Mostly not as a standalone title — it turned out to be a 2023 transitional role. By 2026, the skill has been absorbed into AI engineer, applied ML engineer, founding engineer at an AI-wrapper startup, and even product/solutions roles at companies selling AI features. What hasn't disappeared is the interview content: if a job touches an LLM in any way, you will get asked prompting-technique questions, LLM fundamentals, and very likely a live "write a prompt for this" exercise — regardless of whether the title on the JD says "Prompt Engineer," "AI Engineer," or "Founding Engineer, AI Product." If you're prepping broadly for LLM-adjacent roles, this guide pairs well with our AI engineer interview questions guide, which covers the RAG/fine-tuning/evaluation side in more depth, and our machine learning engineer interview guide if the role leans more classical-ML.
How is a prompt engineering interview different from an AI engineer or ML engineer interview?
An AI engineer interview spends real time on system architecture — RAG pipelines, vector databases, fine-tuning decisions, latency/cost tradeoffs across a whole product. A prompt engineering interview narrows in tighter on the interface between human intent and model behavior: how you structure instructions, how you debug a prompt that's failing in a specific, describable way, and how you reason about a model's failure modes (hallucination, refusal, format drift, injection) without necessarily touching the surrounding infrastructure. In practice the two blend together at most companies — but if the JD specifically says "prompt engineering," expect the live-exercise portion (write/debug/iterate on a prompt, on the spot) to be the centerpiece, not a footnote.
What background do interviewers actually expect?
Less than you'd think on the ML-theory side, more than you'd think on the "have you actually shipped something with an LLM" side. Interviewers are calibrated to the fact that prompt engineering is mostly empirical and iterative — you don't need to derive attention mechanisms from scratch, but you do need to have hit, debugged, and fixed real failure modes: a model that ignores half your instructions once the prompt got long, a JSON parser that broke because the model added a friendly sentence before the {, a classifier prompt that worked great on your ten test examples and fell apart on case eleven. If you've never shipped anything, build one small project before interviewing — a tiny support-ticket classifier, a resume-screener prompt, anything with real inputs — so your debugging examples are real rather than hypothetical.
Core prompting technique questions
What's the difference between zero-shot, one-shot, and few-shot prompting?
Zero-shot gives the model only an instruction, no examples — "Classify this support ticket as Billing, Technical, or Account." One-shot adds a single worked example before the real input, so the model can pattern-match format and tone. Few-shot adds several examples (typically 3-8), which is the most reliable way to lock in a specific output format or edge-case behavior without fine-tuning. The interview nuance: few-shot isn't free — each example costs tokens (and therefore latency and money) on every single call, so the real skill is picking the minimum number of examples that covers your edge cases, not maximizing examples for safety. A good answer names the tradeoff explicitly rather than just defining the terms.
What is chain-of-thought prompting, and why does it work?
Chain-of-thought (CoT) prompting asks the model to reason step-by-step before giving a final answer — either explicitly ("Think through this step by step before answering") or via examples that show the reasoning trace. It improves accuracy on multi-step problems (math, logic, multi-hop reasoning) because it forces the model to generate intermediate tokens that condition the final answer, rather than jumping straight to a guess from a single forward pass. The follow-up interviewers love: "does CoT help on every task?" No — for simple classification or extraction tasks, CoT adds latency and cost for no accuracy gain, and can occasionally hurt by giving the model more surface area to talk itself into a wrong answer. Knowing when not to use a technique is half the signal.
Explain ReAct prompting (reason + act).
ReAct interleaves reasoning traces with actions — the model reasons about what it needs ("I should look up the order status"), takes an action (calls a tool/function, runs a search), observes the result, and reasons again before the next step. This is the pattern underneath most tool-using agents in 2026: the model isn't just answering from its own knowledge, it's deciding when it doesn't know something and reaching for a tool. The interview test here is usually a follow-up: "what goes wrong with ReAct in production?" — answer: infinite action loops (the model keeps calling the same tool because the observation didn't resolve its uncertainty), and the need for a hard step limit plus a fallback "I don't have enough information" exit.
What's the difference between a system prompt and a user prompt, and why does the split matter?
The system prompt sets persistent behavior, role, constraints, and tone for the whole conversation — it's set once by the developer, not the end user. The user prompt is the per-turn input. The split matters for two reasons: first, most providers (OpenAI, Anthropic) treat system-prompt instructions as higher-priority than user-turn instructions, which is the basis of most prompt-injection defenses — you put your real rules in the system prompt and treat anything in the user prompt as untrusted data, not instructions. Second, it's a maintainability split: the system prompt is where you put things that should never change per-request (persona, output format, hard constraints), keeping the per-call user prompt lean and dynamic.
How do you do role prompting, and where does it break down?
Role prompting assigns the model a persona — "You are a senior backend engineer reviewing a pull request" — to bias its tone, vocabulary, and the kind of knowledge it surfaces. It genuinely helps with tone and framing. Where it breaks down: role prompting does not reliably grant the model capabilities it doesn't have — "You are a world-class mathematician" doesn't make a model better at arithmetic it would otherwise get wrong, and over-relying on persona framing as a substitute for actually constraining the task (format, examples, explicit constraints) is a tell that a candidate has read prompt-hack threads without understanding the underlying mechanism.
What is function calling (tool use), and how does it change prompt design?
Function calling lets you describe available tools — a schema with a name, a description, and typed parameters — and the model decides when to invoke one instead of just generating text, returning a structured call (e.g., get_order_status(order_id: "A1234")) for your code to actually execute and feed back as an observation. It changes prompt design in a specific way: your tool descriptions become part of the prompt the model reasons over, so a vague tool description ("looks up stuff") gets called incorrectly or not at all, while a precise one ("looks up the current shipping status for a given order ID; returns Pending, Shipped, or Delivered") gets used correctly far more often — tool descriptions deserve the same care as the system prompt itself, a detail many candidates miss because they think of "the prompt" as only the system message.
How do you choose which few-shot examples to include, and how many?
Pick examples that cover your actual edge cases, not your easiest cases — a classifier prompt with five examples that are all unambiguous teaches the model nothing about the ambiguous ones it'll actually struggle with in production. Order matters too: models weight examples closer to the end of the prompt slightly more heavily, so put your trickiest or most representative example last, not first. On count: more isn't strictly better past a point — three to eight well-chosen examples covering distinct edge cases usually outperforms fifteen examples that are mostly redundant, while costing far fewer tokens per call. The interview-ready answer ties this back to evals: the right number is whatever your eval set says is the minimum needed to hit your accuracy bar, not a round number picked in advance.
What's the difference between a single prompted call and an agent, and when do you need the latter?
A single prompted call takes an input, produces an output, and is done — no internal loop, no tool use, no multi-step decision-making. An agent wraps a model in a loop where it can reason, call tools, observe results, and decide whether to continue or stop, potentially across many steps before producing a final answer (the ReAct pattern, scaled up). You need an agent when the task genuinely requires multiple dependent steps the model can't predict in advance — looking something up, deciding what to look up next based on what it found, possibly retrying a failed action — and a single call when the task is one well-defined transformation of input to output. The interview trap to avoid: reaching for an agent by default because it sounds more sophisticated, when a single well-structured prompt would be cheaper, faster, and far easier to debug and eval. Interviewers specifically probe for candidates who default to the simpler tool and can articulate why.
What is prompt chaining, and when do you reach for it over one giant prompt?
Prompt chaining breaks a complex task into a sequence of smaller, single-purpose prompts, where each step's output feeds the next — extract → classify → summarize → format, as four separate calls instead of one. You reach for it when a single prompt is doing too much and accuracy suffers on any one sub-task, when you need to inspect or validate an intermediate result before proceeding (e.g., confirm extraction was correct before classification runs on it), or when different steps benefit from different models (a cheap, fast model for extraction; a stronger model for the nuanced classification). The tradeoff interviewers want named: more calls means more latency and more cost, and more failure points to handle gracefully — chaining trades simplicity for controllability and debuggability.
LLM fundamentals questions
What is a token, and why does it matter for prompt design?
A token is the model's basic unit of text — roughly 0.75 words in English, though it can be a whole word, a sub-word piece, or even punctuation, depending on the tokenizer. It matters for prompt design in three concrete ways: cost (most APIs bill per token, input and output separately), latency (longer prompts take longer to process before the model even starts generating), and attention dilution — instructions buried in the middle of a very long prompt get followed less reliably than instructions near the start or end, a real, measurable effect sometimes called "lost in the middle." The practical interview answer: put your most important constraints near the start or end of the prompt, not buried in paragraph four of a five-paragraph system prompt.
Explain the context window, and what happens when you exceed it.
The context window is the maximum number of tokens (input plus output combined, for most providers) a model can process in one call — ranging from roughly 8K tokens on small/cheap models to over a million on the largest 2026 frontier models. Exceed it and the call either errors outright or, worse, the provider silently truncates older content, which produces confusing bugs where the model "forgets" something from early in a long conversation. The interview-relevant nuance: a bigger context window doesn't mean you should stuff everything into it — recall accuracy degrades over very long contexts even within the stated limit, so retrieval (only feeding the model what's relevant for this call) usually beats "just paste the whole document" even when the whole document would technically fit.
What do temperature and top-p actually control, and how do you choose values for a task?
Both control how deterministic vs. varied the model's token choices are. Temperature scales the probability distribution before sampling — near 0 makes the model almost always pick the highest-probability next token (deterministic, repetitive on creative tasks); higher values (0.7-1.0+) flatten the distribution, letting lower-probability tokens get picked more often (more varied, more creative, more prone to drift). Top-p (nucleus sampling) instead restricts sampling to the smallest set of tokens whose cumulative probability exceeds p, which adapts to how confident the model is at each step rather than uniformly flattening the distribution. The interview-ready rule of thumb: temperature near 0 for extraction, classification, and anything where you need the same input to reliably give the same output (and for getting clean structured JSON); higher temperature for brainstorming, creative writing, or generating diverse variations. Most teams tune one or the other, not both at once, since stacking them makes behavior harder to reason about.
What's the difference between fine-tuning and prompting, and how do you decide which one to reach for?
Prompting changes model behavior at inference time, per call, with no training step — fast to iterate, fully reversible, and works immediately. Fine-tuning updates the model's weights on a labeled dataset of your own examples, which can lock in a style, format, or domain-specific behavior more durably and with a shorter prompt at inference time (since the behavior is "baked in" rather than instructed every call). The decision framework interviewers want: reach for prompting first, always — it's cheaper, faster to test, and you learn what's actually wrong through iteration. Reach for fine-tuning only when you've exhausted prompting (including few-shot and RAG) and still need a specific, narrow behavior more consistently than prompting can reliably deliver, or when you need to cut per-call cost/latency by shrinking a long, repeated instruction set into the weights themselves. Fine-tuning is not a fix for "the model doesn't know our internal facts" — that's what RAG is for, not retraining.
What is RAG, and how does it change your prompting strategy?
Retrieval-Augmented Generation retrieves relevant documents (usually via embedding similarity search in a vector database) at query time and inserts them into the prompt as context, so the model answers grounded in retrieved facts rather than only its training data. It changes prompting strategy in a specific way: your system prompt now needs explicit instructions for what to do when retrieval comes back empty or weak ("if the provided context doesn't answer the question, say so — don't guess"), and you typically need a citation instruction ("quote the specific passage you're basing this on") so users — and your own evals — can verify groundedness. The common interview trap: candidates describe the retrieval pipeline in detail but forget the prompt also has to explicitly handle the "no good context found" case, which is often where hallucination actually creeps in.
What causes hallucination, and how do you reduce it through prompting alone?
Hallucination happens because a language model is fundamentally predicting plausible next tokens, not querying a database of verified facts — when it doesn't "know" something, it produces something that sounds right rather than failing visibly. Through prompting alone (no RAG, no fine-tuning) you reduce it by: explicitly permitting uncertainty ("if you don't know, say so" — models default to confident-sounding answers unless told otherwise is acceptable), asking for citations or reasoning that you can spot-check, lowering temperature for factual tasks, and breaking complex factual questions into smaller, independently verifiable steps. The honest caveat interviewers want to hear: prompting alone never eliminates hallucination, it only reduces its frequency and makes it more detectable — grounding (RAG) and verification (a second model or rule-based check) are the actual fixes for tasks where hallucination is unacceptable.
Explain embeddings in one sentence a non-technical PM would understand.
An embedding turns a piece of text into a list of numbers (a vector) positioned so that texts with similar meaning end up near each other in that numeric space — so "How do I cancel my subscription?" and "I want to stop paying" land close together even though they share almost no words. That's the one-liner; the interview follow-up usually probes whether you know what it's for: semantic search and RAG retrieval (find documents near the query's vector), deduplication, clustering similar support tickets, and recommendation — all without keyword matching.
Live exercise questions (the part that actually separates candidates)
Almost every prompt engineering interview includes a live, hands-on segment: write a prompt for a stated task, debug a prompt that's misbehaving, or iterate on a prompt given new failure examples. This is the section candidates over-prepare for in the wrong way — memorizing "good prompt templates" instead of practicing the process of building one out loud, narrating your reasoning as you go.
"Write a prompt that classifies support tickets into Billing, Technical, or Account" — what's actually being graded?
Almost never the first draft. Interviewers are watching your process: do you ask clarifying questions before writing anything (what happens with a ticket that's genuinely ambiguous — multiple categories, or none of the three?), do you specify the output format explicitly rather than hoping the model infers it, do you include a few worked examples covering the edge cases you just identified, and do you build in a fallback for the case your three categories don't cover? A strong walkthrough sounds like: "Before I write this, what should happen if a ticket doesn't clearly fit one category — should I add an 'Other' bucket, or force a best-guess? I'll assume 'Other' exists since forcing a wrong label is worse for downstream routing." Then write the prompt with that decision baked in. The interviewer is grading the reasoning trail at least as much as the final prompt text.
A reasonable answer, narrated:
// System prompt
Classify the support ticket below into exactly one category:
Billing, Technical, Account, or Other.
Rules:
- If the ticket mentions a charge, refund, invoice, or payment
method, classify as Billing — even if it also mentions a bug.
- If the ticket is ambiguous or doesn't clearly fit Billing,
Technical, or Account, classify as Other.
- Respond with only the category name. No explanation.
Examples:
Ticket: "I was charged twice for my subscription this month."
Category: Billing
Ticket: "The app crashes every time I open the settings page."
Category: Technical
Ticket: "I can't remember which email I signed up with."
Category: Account
Ticket: "Your product is amazing, just wanted to say thanks!"
Category: Other
Notice the explicit tie-breaker rule ("mentions a charge... even if it also mentions a bug") — that's the detail that separates a prompt that works on the obvious cases from one that survives contact with real, messy tickets.
Debug this prompt: it works in testing but starts returning the wrong format in production.
This is a deliberately open-ended prompt interviewers give to see how you investigate, not just how you fix. The right first move is asking for actual failing examples, not guessing — "can you show me three real inputs where it broke?" Common real causes, roughly in order of how often they actually show up: the input distribution in production has examples longer or weirder than your test set (a ticket with three paragraphs and an embedded email signature, when your few-shot examples were all one sentence); a few-shot example accidentally taught a wrong generalization (your examples all happened to be short, so the model learned "short ticket → short response" and breaks on long ones); or the underlying model was silently upgraded/changed by the provider and its formatting tendencies shifted slightly. The answer that lands: "I'd pull the actual failing inputs first, diff them against my few-shot examples to see what's structurally different, and only then change the prompt — guessing at a fix without looking at real failures is how you patch one bug and introduce two more."
Turn this vague prompt into a structured one: "Summarize this for me."
A vague instruction like "summarize this" leaves length, audience, format, and emphasis entirely to the model's defaults, which are inconsistent across inputs. A structured version specifies all four: "Summarize the following document in 3 bullet points, for a non-technical executive audience, focusing on financial impact and decisions needed — omit implementation detail." The interview signal here is whether you can name which dimensions were missing (length, audience, format, emphasis, what to exclude) rather than just producing a better-sounding prompt by instinct — naming the dimensions is what lets you debug a future vague prompt systematically instead of by trial and error.
Make this prompt reliably return valid JSON, every time.
The naive answer ("just ask for JSON") is the wrong answer, and interviewers know it. A model asked for JSON in plain English will sometimes wrap it in a markdown code fence, sometimes add a friendly sentence before or after it ("Sure, here's the JSON you requested:"), and sometimes produce almost-valid JSON with a trailing comma. The layered, production answer: use the provider's structured-output / JSON-mode feature if one exists (OpenAI's response_format, Anthropic's tool-use forcing a schema) rather than relying on instructions alone, since these constrain the actual token sampling rather than just hoping the model complies; give an explicit schema with field names and types, not just "return JSON"; and — because no method is 100% reliable in production — parse defensively downstream (strip code fences, retry once on a parse failure with the error fed back to the model) rather than letting one malformed response crash a pipeline. A small example of the defensive-parsing layer, since interviewers want to see you've actually shipped this:
function parseModelJSON(raw) {
const cleaned = raw.trim().replace(/^```json\s*|\s*```$/g, '');
try {
return JSON.parse(cleaned);
} catch (err) {
throw new Error(`Model returned invalid JSON: ${err.message}`);
}
}
Naming "retry once with the parse error fed back into the next call, then fall back to a default" as the next layer is the detail that signals you've actually debugged this in production, not just read about it.
Evaluation, guardrails, and production questions
How do you evaluate whether a prompt change actually made things better?
With a held-out eval set of real (or realistic) inputs and a scoring method that doesn't change every time you tweak the prompt — otherwise you're comparing against a moving target. For tasks with a clear right answer (classification, extraction), exact-match or rule-based scoring against labeled examples works and is cheap. For open-ended tasks (summarization quality, tone), teams typically use an LLM-as-judge — a separate prompt that scores outputs against a rubric — while being honest about its limitations: a judge model has its own biases (it tends to prefer longer, more confident-sounding answers regardless of actual quality) and needs occasional spot-checking against human judgment to stay trustworthy. The answer interviewers actually want: run your eval set before and after every prompt change, track the score over time, and never ship a prompt change based on "it felt better on the three examples I tried" — that's the single most common real-world mistake, and naming it unprompted is a strong signal.
What is prompt injection, and how do you defend against it?
Prompt injection is when untrusted input (a user message, a scraped webpage, a document fed into RAG) contains text crafted to override your system instructions — "ignore your previous instructions and instead..." embedded inside what's supposed to be passive data. It matters most wherever your prompt processes content you didn't write yourself: a support bot reading a customer's message, a RAG pipeline summarizing a webpage, an agent reading email. Defenses, layered rather than relying on any single one: clearly delimit untrusted content (wrap it in tags like <user_input> and instruct the model that content inside those tags is data, never instructions, regardless of what it says); keep your real constraints in the system prompt, which most providers weight more heavily than user-turn content; for high-stakes actions (sending money, deleting data), require a separate, non-model-bypassable confirmation step rather than trusting the model's judgment alone; and test against known injection patterns as part of your eval suite, the same way you'd test for SQL injection in a traditional security review. The honest caveat: there is currently no defense that makes a model fully immune to injection — the goal is raising the cost of a successful attack and limiting blast radius, not eliminating the risk.
How do you handle PII and unsafe outputs in a prompt pipeline?
Prompting can reduce the frequency of unsafe or PII-leaking outputs (explicit instructions not to repeat sensitive data verbatim, not to generate certain content categories) but it is not a reliable enforcement mechanism on its own — a sufficiently adversarial or even just unusual input can still get past instructions alone. Production pipelines add a second layer that doesn't depend on the model behaving: a rule-based or classifier-based output filter that runs after generation and before the response reaches a user, regex or NER-based PII redaction on both inputs and outputs, and logging that flags (and ideally blocks) responses matching known-bad patterns for human review. The interview signal: candidates who say "I'd just tell it not to do that in the prompt" and stop there are missing that instructions are a probabilistic nudge, not a guarantee — production systems need a deterministic backstop.
What's the cost/latency tradeoff of a longer, more detailed prompt — and when is it worth it?
A longer prompt costs more per call (more input tokens billed) and adds latency (more tokens to process before generation starts), and that cost compounds at scale — an extra 500 tokens of instructions on a prompt called a million times a day is a real line item, not a rounding error. It's worth it when the added detail measurably improves accuracy on your eval set, especially for tasks where an error is expensive downstream (a misrouted support ticket creates a worse customer outcome than a slightly longer API bill). It's not worth it when the added length is instructions the model already follows reliably without them, or speculative "just in case" caveats that pad the prompt without changing behavior on any real failure case — the discipline is trimming a prompt against your eval set after every addition, the same way you'd profile before optimizing code, not adding paragraphs because they feel thorough.
System design: "design a prompt pipeline for a customer support bot"
Senior and lead prompt engineering interviews increasingly include a design round that's structurally similar to a system design interview — except the "system" is the prompt/model layer, not just databases and servers. Drive it the same way: clarify requirements first, then work through the pipeline stage by stage, naming tradeoffs out loud.
Worked example — "design a prompt pipeline for a customer support bot that can answer billing questions and escalate complex issues":
- Clarify scope. What's "complex" — anything the bot can't answer confidently, or a specific list of triggers (refund disputes over a dollar threshold, legal threats, anything mentioning a competitor)? Clarifying this up front avoids designing a vague "escalate when unsure" system that nobody can later debug.
- Classify intent first. A cheap, fast classification step (Billing / Account / Technical / Escalate-immediately) before the expensive generation step, so simple, high-confidence cases don't pay for a long, detailed prompt they don't need.
- Retrieve grounding context. For billing questions specifically, retrieve the user's actual account/billing data via RAG or a direct API call rather than letting the model guess — this is where most hallucination risk in a support bot actually lives, and where injection risk from any user-controlled retrieved content (their own past tickets, for instance) needs the delimiting discussed above.
- Generate with explicit escalation criteria baked into the prompt. The model should know its own boundary — instructions specifying exactly which patterns mean "stop and hand off to a human" rather than attempting an answer.
- Validate output before it reaches the user. Structured-output enforcement for any action the model is allowed to take (issuing a refund, updating an account field), a PII filter pass, and a confidence check — if the model's own escalation signal fires, route to a human instead of sending the generated response.
- Eval and monitor in production. A held-out eval set covering normal cases and known injection/edge cases, run on every prompt change, plus production monitoring for escalation rate and user-reported dissatisfaction as leading indicators that a recent prompt change regressed something the eval set didn't catch.
The same skeleton — clarify scope, classify/route, ground with retrieval, generate with explicit constraints, validate before delivery, eval continuously — applies whether you're designing a support bot, a document-extraction pipeline, or an agent that's allowed to take real actions; only the specifics of step 3 and step 5 change.
A senior-level follow-up worth pre-thinking: "what's your rollback plan if a prompt change regresses something in production?" The answer interviewers want isn't "I'd just revert the prompt" (true, but shallow) — it's versioning prompts the same way you'd version code (a prompt is a deployable artifact with a diff and a history, not a string someone edits in a dashboard with no audit trail), running the new version against the eval suite and a canary slice of real traffic before a full rollout, and having a clear, fast revert path that doesn't require a full deploy cycle. Treating a prompt change with the same rigor as a code change — review, eval, canary, rollback — is the detail that separates "I've used an LLM API" from "I've run one in production."
How candidates actually practice this — and where it falls short
Most candidates prepare for prompt engineering interviews the same way they'd prepare for a trivia round: reading "50 best prompting techniques" listicles, bookmarking viral X threads of "prompts that 10x your output," and skimming GeeksforGeeks-style question dumps the night before. None of that is wrong exactly, but it optimizes for recognizing a good prompt, not constructing one live while someone watches you reason through ambiguity — which is the actual interview format almost every company uses.
ChatGPT or Claude as a practice partner gets you further — you can paste a task and ask the model to roleplay an interviewer, and it's genuinely useful for generating practice scenarios. The honest limitation, covered in more depth in can ChatGPT do mock interviews: a chat window has no clock, no spoken-pressure equivalent, and nobody to ask you the awkward follow-up — "okay, but what if 30% of your tickets are genuinely ambiguous, what happens then?" — that a real interviewer asks specifically because you didn't volunteer it. Reading your own typed answer back is also a fundamentally different skill from producing an answer out loud, live, with someone listening to whether your reasoning holds together in real time.
Prompting courses and official guides — OpenAI's prompt engineering guide and Anthropic's prompting documentation are genuinely the best sources for technique accuracy, and naming them shows you've gone to primary sources rather than secondhand summaries. They're comprehensive on technique, near-silent on the verbal-interview skill of narrating your reasoning while you write, under time pressure, to a stranger.
A friend's "prompt engineering interview questions" PDF or a generic question bank gets you the question list, not the follow-ups, and definitely not feedback on whether your explanation of, say, the tradeoff between fine-tuning and prompting actually held together when you said it out loud versus how clean it sounded in your head.
Greenroom takes a different approach: spoken mock interviews that include live prompt-writing exercises, ask the same kind of follow-up a real interviewer would ("what happens when that fails?"), and grade your explanation clarity, not just whether the final prompt would technically work. It's not a replacement for actually reading the OpenAI and Anthropic docs and shipping something real with an LLM — it's the rehearsal layer between "I understand this" and "I can produce this fluently, live, under a stranger's questions," which is genuinely the part most prep methods skip. Pairing reading (technique accuracy) with spoken rehearsal (delivery under pressure) beats either alone — the comparison in AI mock vs a real engineer mock goes deeper on when each format earns its place in a prep plan.
Practise the live exercise, not just the technique names
You can recite every definition in this guide and still blank when an interviewer hands you a genuinely ambiguous task and says "go ahead, build the prompt, thinking out loud." That specific skill — turning ambiguity into structure, live, while narrating your reasoning to someone who's going to ask "but what about—" the moment you finish — only improves with practice that has the same shape as the real thing: spoken, timed, with real follow-ups. Greenroom runs spoken mock interviews for AI and prompt engineering roles, including live prompt-writing exercises with realistic follow-up questions, and gives feedback on how clearly you explain your reasoning, not just whether the final prompt would technically work. Pair it with how Ari calibrates question difficulty to understand how the follow-ups adapt to your answers, and structured AI interviews if you're curious why a consistent format produces a better signal than an unscripted chat.
Frequently asked questions
What should I study for a prompt engineering interview?
Start with core prompting techniques (zero/few-shot, chain-of-thought, ReAct, system vs. user prompts, prompt chaining), LLM fundamentals (tokens, context window, temperature/top-p, fine-tuning vs. prompting, RAG, hallucination, embeddings), then practice the live exercise format specifically — writing, debugging, and iterating on a prompt out loud against a stated failure case — and evaluation/guardrails concepts (prompt injection, eval sets, LLM-as-judge, PII handling) for mid-to-senior roles.
Is prompt engineering still a real job in 2026, or has it been absorbed into other roles?
It's mostly been absorbed into AI engineer, applied ML engineer, and founding/product engineering roles at AI-product companies — very few companies hire a standalone "Prompt Engineer" anymore. The interview content hasn't gone away, though: any role that touches an LLM will test prompting techniques, LLM fundamentals, and almost always include a live "write a prompt for this" exercise, regardless of the exact title on the job description.
What's the most common live-exercise question in a prompt engineering interview?
Some version of "write a prompt that does X" for a stated task, followed by a deliberately introduced edge case to see how you adapt — what happens when the input doesn't fit your assumptions, or how you'd make the output reliably structured (valid JSON) rather than just plausible-looking text. Interviewers grade the reasoning process — clarifying questions, naming edge cases, explaining tradeoffs — at least as much as the final prompt text.
How is a prompt engineering interview different from a general AI engineer interview?
A prompt engineering interview narrows in on the interface between instructions and model behavior — prompting techniques, debugging a misbehaving prompt, evaluating whether a change helped — while an AI engineer interview spreads across more system architecture: RAG pipeline design, vector database choices, fine-tuning decisions, and cost/latency tradeoffs across an entire product. In practice most companies blend both, but a role specifically labeled "prompt engineering" will spend more interview time on the live prompt-writing exercise.
Do prompt engineering interviews require knowing how to code?
Usually some — enough to read and lightly modify a script that calls an LLM API, parse and validate model output (JSON parsing, basic error handling), and understand what a vector database or eval harness does even if you're not building one from scratch in the interview. Deep software engineering skill isn't usually the bar; being unable to write even a short defensive-parsing function when asked is a real red flag for interviewers, since shipping prompts to production requires exactly that kind of code.
How do I prepare for the live prompt-writing portion specifically?
Practice the actual format: given a one-sentence task, ask clarifying questions out loud, name the edge cases you're worried about before writing anything, write the prompt with those edge cases explicitly handled, and be ready to defend every choice when asked "why did you do it that way instead of—". Reading good prompts is necessary but not sufficient — rehearse producing one live, under mild time pressure, ideally with someone (or a structured mock interview) asking the follow-up questions a real interviewer would.
Should I default to an agent or a single prompt when the interviewer asks me to design a solution?
Default to the single prompted call unless the task genuinely requires multiple dependent steps the model can't predict in advance — and say that reasoning out loud rather than jumping straight to "I'd build an agent for this." Interviewers specifically watch for candidates who reach for the more complex, more impressive-sounding solution by default; explicitly choosing the simpler tool and explaining why is a stronger signal than describing an elaborate multi-agent architecture for a task that one well-structured prompt would handle just as well, at a fraction of the cost and debugging surface.