SLOs, SLIs, and Error Budgets: The Core of SRE¶

Estimated time to read: 4 minutes

In modern software operations, "100% uptime" is a myth. Every system eventually fails. Site Reliability Engineering (SRE) accepts this reality and provides a mathematical framework for managing it.

This guide explores the three pillars of SRE reliability: SLIs, SLOs, and Error Budgets.

For a comparison of how SRE differs from DevOps and Platform Engineering, see DevOps, SRE, and Platform Engineering: A Comparative Guide.

Service Level Indicators (SLI)¶

An SLI is a quantitative measure of some aspect of the level of service that is provided. It answers the question: "How is the service performing right now?"

Common SLIs include:

Service Availability: The percentage of time the service is usable, typically measured by successful HTTP responses (e.g., 2xx/3xx status codes).

Request Latency: The time required to service a request, usually tracked via percentiles such as P50 (median) or P99 (the "worst-case" experience).

System Throughput: The total volume of requests or data units processed per second.

Error Rate: The percentage of valid requests that result in a failure or exception.

Data Freshness: For pipelines and asynchronous systems, this measures how recently the underlying data was successfully updated.

The SLI Equation¶

Most SLIs are expressed as a ratio: $$\text{SLI} = \frac{\text{Good Events}}{\text{Valid Events}} \times 100$$

Service Level Objectives (SLO)¶

An SLO is a target value or range of values for a service level that is measured by an SLI. It answers the question: "How good do we want the service to be?" Instead of "I want the service to be fast," an SLI would be...

Error Budgets and Availability: Most SLOs are defined in terms of...

Examples¶

Availability SLO: 99.9% of requests over a rolling 30-day window should return a 200 OK status.

Latency SLO: 90% of requests over a rolling 30-day window should complete in less than 200ms.

Why not 100%?¶

Setting an SLO of 100% is usually a mistake because:

User Perceived Experience: Users typically do not notice the difference between 99.9% and 100% availability due to the inherent unreliability of their own internet connections.

Exponential Cost Scaling: The financial and engineering effort required to move from "three nines" to "four nines" is often 10x higher, providing diminishing returns for most products.

Innovation and Velocity: A requirement for zero failure creates a culture of risk-aversion, effectively frozen feature development and slowing time-to-market.

The Error Budget¶

The Error Budget is the most powerful concept in SRE. It is derived directly from the SLO and represents the amount of "unreliability" you are allowed to have.

\[\text{Error Budget} = 100\% - \text{SLO}\]

If your SLO is 99.9%, your Error Budget is 0.1%.

The Error Budget acts as a neutral arbiter between Development (who want speed) and Operations (who want stability).

Saturated Budget: When the budget is full, the team can move fast, deploy risky features, and experiment with high-impact architectural changes.

Depleting Budget: If the budget is nearly empty, the team must prioritise stability fixes and improve automated testing.

Exhausted Budget: Once the budget is spent, all non-emergency changes are frozen until the compliance window recovers.

Connecting the Frameworks¶

Reliability management doesn't happen in a vacuum. It interacts with all other OpsAtScale modules:

DORA Metrics Correlation: Your SLOs directly influence key benchmarks such as Change Failure Rate and Mean Time to Recovery. See DORA and SPACE Metrics.

Observability Foundation: Accurate SLI measurement requires a mature observability stack. See Observability Quality Metrics.

Incident Management Lifecycle: SLO breaches trigger formal incidents and Root Cause Analysis / Post-Mortems to ensure systemic improvements are made.

Practical Implementation Checklist¶

Phase 1: Exploration¶

Identify your critical user journeys (e.g., "User can checkout", "User can search").
Choose SLIs that accurately reflect the health of those journeys.
Define what a "Good Event" and a "Valid Event" looks like.

Phase 2: Definition¶

Set attainable SLOs based on historical data.
Define the compliance window (e.g., rolling 7 days or 30 days).
Document these in an SLO Agreement between Product and Engineering.

Phase 3: Automation¶

Create SLO Dashboards in your observability tool (e.g., Prometheus, Grafana).
Set up Error Budget Alerts (alerting when the budget is burning too fast).
Automate the reporting of these metrics into your OKRs.

Phase 4: SRE Maturity¶

Implement Error Budget Policy: Agree on what happens when the budget is zero.
SLOs-as-Code: Store your SLO definitions in Git using tools like OpenSLO.
Conduct SRE Toil Reviews to periodically automate away manual work discovered via SLO breaches.