Site Reliability Engineering blends software engineering with operations, and its interviews test reliability concepts (SLIs/SLOs), incident response, monitoring, system design, and — the signature SRE skill — calm troubleshooting under failure. Here are the SRE interview questions that actually get asked. (See also our DevOps guide.)
Reliability concepts
- What are SLIs, SLOs, and SLAs?
- What is an error budget and how does it guide decisions?
- What does "reliability" actually mean; availability math (the nines).
- Toil — what it is and why SREs automate it away.
Monitoring & incidents
- The three pillars of observability — logs, metrics, traces.
- How do you handle an incident? On-call best practices.
- What goes into a good blameless post-mortem?
- Alerting — how do you avoid alert fatigue?
Systems & troubleshooting
- Linux and networking fundamentals (our Linux guide).
- System design with a reliability focus (our guide).
- "A service is down — walk me through how you debug it."
- Coding/scripting for automation.
How to prepare
The troubleshooting and incident rounds are conversational and high-pressure. Practise reasoning through failures calmly out loud. Greenroom runs spoken technical interviews that follow up on your reasoning. Pair it with our DevOps and Linux guides.
Frequently asked questions
What questions are asked in an SRE interview?
SRE interviews cover reliability concepts (SLIs, SLOs, SLAs, error budgets, availability math, toil), monitoring and observability (logs, metrics, traces), incident response and on-call, blameless post-mortems, alerting and alert fatigue, Linux and networking fundamentals, reliability-focused system design, troubleshooting scenarios like 'a service is down, debug it', and scripting for automation.
What is the difference between SLI, SLO and SLA?
An SLI (Service Level Indicator) is a measured metric of service behavior, like request latency or error rate. An SLO (Service Level Objective) is the target value for an SLI, such as 99.9% of requests under 200ms. An SLA (Service Level Agreement) is a contractual commitment to customers, usually with penalties, and is typically looser than the internal SLO. The error budget is the allowed gap below 100% reliability.
What is an error budget in SRE?
An error budget is the acceptable amount of unreliability derived from an SLO — for example, a 99.9% availability SLO allows about 0.1% downtime as the budget. It balances reliability and velocity: as long as the budget isn't exhausted, teams can ship features quickly; if it's burned, they pause feature work to focus on reliability. It turns reliability into a measurable, shared decision-making tool.
How should I prepare for an SRE interview?
Study reliability concepts (SLIs/SLOs, error budgets), observability, incident response and post-mortems, plus Linux, networking and reliability-focused system design. Most importantly, practise the troubleshooting scenario ('the service is down, what now?') by reasoning through failures methodically and calmly out loud with a voice-based mock interview that follows up, since composure under pressure is the signature SRE signal.