Root Cause & Postmortem

Estimated time to read: 8 minutes

"Root Cause Analysis" and "Postmortem" are two strategies used in problem-solving, usually in the fields of business, engineering, or software development. Both strategies are centred around identifying the cause of a problem or an issue, but they are used in different contexts and have slight differences in approach. Both are core practices in DevOps and SRE, and form the final "Learning" phase of the Incident Management Lifecycle.

Root Cause Analysis (RCA)¶

This process is often used to identify the fundamental cause of a problem or fault. The goal is to find the 'root' cause and then address that issue, to prevent the problem from recurring in the future. RCA typically involves a systematic approach where you look beyond the immediate causes and dig deeper into any underlying issues that led to the problem. It's often used proactively to improve business processes and reduce errors or issues. The RCA process can utilise several techniques, such as the "5 Whys" method, "Fishbone" or Ishikawa diagrams, and fault-tree analysis.

People Involved in RCA¶

The people involved in the RCA process would typically include those who are knowledgeable about the problem or the process in which the problem occurs. This could include team members involved in the day-to-day work, supervisors, process owners, or quality assurance team members. It is also helpful to include a diverse group of people who can bring different perspectives to the process.

Methods for Conducting RCA¶

Several methods and tools can be used to conduct a Root Cause Analysis, and the choice often depends on the complexity and nature of the problem. Here are a few examples:

The 5 Whys: A simple yet powerful technique involving asking "Why?" repeatedly until the root cause is isolated. Each layer of questioning peels back the symptoms to reveal the underlying systemic failure.

Fishbone Diagram (Ishikawa): A visual mapping tool used to display the potential causes of a problem. Causes are categorised into branches (e.g., People, Process, Technology) to identify the "spine" of the issue.

Fault Tree Analysis (FTA): A top-down, deductive failure analysis where an undesired state of a system is broken down using Boolean logic to understand high-level failure combinations.

Pareto Analysis: A statistical technique used to identify the "vital few" causes that account for the majority of problems, ensuring that remediation efforts are focused on the highest-impact issues.

How to start a Root Cause Analysis¶

The process typically begins when a problem is identified. From there, a systematic set of steps are followed to uncover the root cause of that problem.

Problem Definition: Clearly define the failure in measurable, technical terms. Understand the full scope (who, what, where, when) and document every observable detail related to the issue.

Data Collection: Gather all relevant logs, metrics, and event traces. This quantitative data ensures that the analysis is based on evidence rather than assumptions.

Causal Factor Identification: Brainstorm all potential factors that could have contributed to the issue. Diversifying the group involved ensures a 360-degree view of the process.

Root Cause Isolation: Analyse the causal factors to pinpoint the fundamental underlying trigger. The goal is to identify a single point of failure where a specific change would have prevented the incident.

Corrective Action Implementation: Develop and execute a rigorous plan to eliminate the root cause. This often involves architectural changes or process automation.

Effectiveness Verification: Monitor the system post-fix to ensure the corrective actions have permanently resolved the issue without introducing regression.

The goal of RCA is not to assign blame but to understand what happened and prevent reoccurrence. It is an iterative process that serves as the foundation for continuous improvement. One of the best practices is to maintain a Lessons Learned repository.

Postmortem¶

Postmortem is typically conducted after a project or event has concluded, or a major problem has occurred, to analyse what happened, why it happened, and how to prevent similar issues in the future. It's often used in software development following a major issue like a service outage or a project that did not meet its goals. In this process, all aspects of the situation are examined in order to learn lessons and improve future performance. While a postmortem also looks for root causes, it also includes a wider evaluation of what was done well, what could have been better, and how to improve for next time. Effective observability tooling is a prerequisite for good postmortems see the Observability Quality Metrics guide.

Postmortem Culture¶

Google's SREs view incidents and outages as inevitable given the scale and velocity of change in their systems. When an incident occurs, they fix the underlying issue, and services return to their normal operating conditions. However, unless there's a formalised process of learning from these incidents, they may recur indefinitely. Therefore, postmortems, which are written records of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring, are an essential tool for SRE. Postmortems are expected after any significant undesirable event, and writing a postmortem is not punishment-it's a learning opportunity for the entire company. Furthermore, postmortems are blameless, focusing on identifying the contributing causes of the incident without blaming any individual or team for bad or inappropriate behaviour.

What is a Postmortem procedure?¶

A Postmortem, also known as a Root Cause Analysis (RCA), is a process for understanding and documenting the root cause of a failure in a system. The goal of the Postmortem process is to learn from the failure and prevent it from happening again in the future. It should not be a process for assigning blame. Instead, it should focus on what went wrong, why it went wrong, how to fix it, and how to prevent the same problem from happening again.

People Involved in a Postmortem¶

Several roles are typically involved in a postmortem.

Incident Author: The individual responsible for drafting the postmortem, typically the engineer who was on-call or primarily involved in the resolution.

Accuracy Reviewers: Subject matter experts (SMEs) who review the document for technical precision and completeness, ensuring the remediation steps are viable.

Strategic Approver: A team lead or engineering manager who provides the final sign-off, ensuring that the identified actions are prioritised in the backlog.

Affected Stakeholders: Internal and external parties notified of the incident impact. Their feedback ensures that client communication and service expectations are managed.

Methods for Conducting a Postmortem¶

Google's SRE book suggests a structured method for conducting a postmortem. Here are the main steps:

Data Gathering: Accessing all logs, metrics, error traces, and user reports related to the failure window.

Chronological Timeline Mapping: Creating a precise list of events leading up to and following the incident to identify the exact sequence of failure and recovery.

Root Cause Isolation: Performing a blameless analysis of the timeline to identify the technical or process trigger for the event.

Remediation Action Items: Proposing specific, trackable tasks (e.g., code changes, alerting thresholds) to prevent incident reoccurrence.

Blameless Documentation: Publishing the final report in a public forum to share learnings and foster a high-transparency engineering culture.

How to Start a Postmortem¶

Starting a postmortem usually involves the following steps:

Threshold Identification: Postmortems are triggered by specific events, such as data loss, prolonged downtime, or critical security vulnerabilities.

Role Distribution: Assigning the author, reviewers, and approver to ensure accountability for the completion of the report.

Evidence Aggregation: Centralising all incident data into a collaborative workspace for team analysis.

Timeline Architecting: Drafting the initial sequence of events based on automated logs and responder Slack/Teams conversations.

Blameless Investigation: Launching the deep-dive analysis to understand the "systemic why" without focusing on human error.

Remember, the goal of a postmortem is not to blame individuals, but to learn from mistakes and improve systems and processes. It's about fostering a culture of learning and continuous improvement.

Embracing Risk¶

Google doesn't try to build 100% reliable services, because extreme reliability comes with costs. Maximising stability can limit the speed at which new features can be developed and delivered to users and can dramatically increase their cost. Users typically don't notice the difference between high reliability and extreme reliability in a service because the user experience is dominated by less reliable components like the cellular network or the device they are using1.

Managing Risk¶

In Site Reliability Engineering (SRE), risk is managed by balancing the risk of unavailability with the goals of rapid innovation and efficient service operations. The cost of increasing reliability doesn't increase linearly; an incremental improvement in reliability may cost significantly more than the previous increment. This costliness has two dimensions: the cost of redundant machine/compute resources, and the opportunity cost, which is the cost borne by an organisation when it allocates engineering resources to build systems or features that diminish risk instead of features that are directly visible to or usable by end users1.

Measuring Service Risk¶

At Google, the most straightforward way of representing risk tolerance is in terms of the acceptable level of unplanned downtime, which is expressed in terms of the number of "nines" they would like to provide: 99.9%, 99.99%, or 99.999% availability. Each additional nine corresponds to an order of magnitude improvement toward 100% availability. However, for globally distributed services, instead of using metrics around uptime, Google defines availability in terms of the request success rate

In summary, while both methods aim to uncover the causes of problems, Root Cause Analysis is usually more narrowly focused on identifying and addressing the core cause of an individual problem, while a Postmortem is a broader review of performance, problems, and success factors after a project or major incident.