Intro¶
Chaos Engineering intentionally introduces failures, disruptions, or stress into a system to test its resilience and identify weaknesses. The goal is to improve the system's reliability and performance by discovering and addressing potential issues before they manifest in real-world situations, such as outages or degraded user experiences.
Why Chaos Engineering¶
There are several reasons why you may need Chaos Engineering:
Complex Systems: Modern software systems are increasingly distributed and complex, making it difficult to predict every failure scenario. Chaos Engineering helps you uncover these invisible dependencies before they fail in production.
Improve System Reliability: By proactively introducing failures and observing their impact, you can identify and address vulnerabilities before they lead to outages or performance degradation, significantly improving overall availability.
Continuous Learning: Chaos Engineering encourages a culture of constant learning. As you run experiments, you gain deep insights into your system's actual behaviour under stress, informing better architectural and operational decisions.
Mitigate Risk: Proactively fixing potential issues reduces the risk of unexpected outages that lead to customer dissatisfaction, loss of revenue, or reputational damage.
Optimise Resource Usage: Chaos experiments can highlight inefficiencies, such as underutilised servers or suboptimal configurations, providing clear data for cost savings and performance tuning.
Enhance Team Collaboration: Chaos Engineering fosters collaboration between development, operations, and security teams. This shared understanding ensures everyone is aligned on the requirements for a truly resilient system.
In summary, Chaos Engineering is a practical approach for testing and improving the resilience of your software systems. By proactively identifying and addressing potential issues, you can enhance system reliability, optimise resources, and foster a culture of continuous learning and collaboration within your organisation.
Planning Your First Chaos Experiment¶
Define the Scope: Begin by identifying the specific system, service, or component you want to test. Always start with a smaller, less critical part of your infrastructure before moving on to complex, core systems.
Identify Potential Weaknesses: Gather your team to brainstorm potential failure points. Consider internal and external dependencies, stateful data stores, and third-party API reliability.
Prioritise Scenarios: Rank identified weaknesses according to their likelihood and potential business impact. Focus your initial experiments on the most critical and probable scenarios first.
Formulate Hypotheses: For each scenario, develop a clear hypothesis about the expected outcome when the failure is introduced. Consider the impact on customers, dependencies, and service-level objectives (SLOs).
Define Key Performance Metrics: Establish metrics that correlate to business success (e.g., orders per minute, login success rate). Monitor these closely during the experiment to ensure the blast radius is controlled.
Design the Experiment: Create a detailed plan specifying the failure to be injected, the injection method (e.g., network latency, node termination), and the duration of the test.
Prepare Rollback Plans: Develop an immediate rollback plan to revert the experiment's impact. Ensure the team is ready to abort the experiment and return the system to its normal state if thresholds are breached.
Conduct the Experiment: Run the chaos experiment while closely monitoring KPIs and system behaviour. Be prepared to hit the "kill switch" if you observe unintended side effects.
Analyse Results: After completion, determine whether your hypothesis was correct. Identify any unexpected findings or "cascading failures" that were uncovered during the test.
Implement Improvements: Based on the data, identify the root causes of any failures and implement the necessary architectural or configuration changes to enhance resilience.
Repeat and Iterate: Chaos Engineering is a continuous process. Regularly re-run experiments after changes to ensure previously fixed weaknesses do not re-emerge.
Remember, Chaos Engineering is a proactive approach to improving system reliability. By regularly conducting chaos experiments, you can identify and address potential issues before they result in significant downtime or customer impact. So, embrace the process and have fun while improving your systems.
Example of Chaos planning¶
| Scope | Potential Weaknesses | Hypotheses | Key Performance Metrics | Implementation Method | Metrics to Observe & Analyse |
|---|---|---|---|---|---|
| AWS EC2 Instances | Instance failure | An instance failure will not impact the overall service due to auto-scaling and load balancing | Latency, Request success rate | Terminate a random EC2 instance using AWS CLI or SDK, and observe the system's response | EC2 instance count, Auto Scaling events, Load balancer request distribution, CPU & memory usage |
| AWS RDS Database | Database connection failure | Connection failures will trigger auto-retry and fallback to read replicas without affecting users | DB connection errors, Latency | Introduce network issues between app and RDS using AWS Security Groups or VPC NACLs | Database connection errors, RDS replica lag, Query execution time, Application error rates |
| Kubernetes Nodes | Node failure | Kubernetes will reschedule affected pods to other healthy nodes without impacting user experience | Pod restart count, Node CPU usage | Drain a Kubernetes node using kubectl drain, then observe the pod rescheduling process | Node status, Pod status, Pod restarts, Node resource usage (CPU, memory) |
| Kubernetes Pod | Pod crash | Crashing pods will be automatically restarted, ensuring minimal user impact | Pod restart count, Latency | Introduce a fault within a pod, e.g., by using kubectl exec to kill a critical process | Pod status, Pod restarts, Container logs, Application error rates |
| AWS S3 | S3 latency increase | Increased S3 latency will cause delays but not service outages, as retries and timeouts will be handled | S3 latency, Request success rate | Use a tool like AWS Fault Injection Simulator (FIS) to simulate latency increase in S3 API calls | S3 request latency, S3 error rates, Application response time |
| Kubernetes Service | Network latency between services | Increased network latency will cause delays but not service outages due to built-in retries and timeouts | Service-to-service latency | Inject latency between services using a service mesh like Istio, or using tools like tc or iptables | Service-to-service latency, Request success rate, Application response time |
| AWS DynamoDB | Throttling errors | Throttling errors will be handled by retries with exponential backoff, ensuring limited user impact | Throttling errors, Latency | Temporarily decrease DynamoDB provisioned capacity or use AWS FIS to simulate throttling errors | Throttling errors, Read/write capacity utilisation, Latency, Application error rates |
| Kubernetes Ingress Controller | Ingress controller failure | A failure in the ingress controller will cause temporary service disruption, which will be resolved quickly | Ingress error rate, Latency | Disable or introduce faults to the ingress controller, e.g., by modifying its configuration or scaling | Ingress error rates, Ingress latency, Ingress controller logs, Pod status |
| AWS Lambda | Lambda function timeout | Lambda timeouts will be handled by retries and fallbacks, ensuring limited user impact | Lambda timeouts, Latency | Modify the Lambda function to include an intentional delay, or reduce the function's timeout setting | Lambda invocation count, Lambda duration, Lambda error rates, Cold start count, Application error rates |
| AWS Kinesis Streams | Stream processing delays | Delays in processing Kinesis streams will cause | Stream processing delays | Delays in processing Kinesis streams will cause temporary lag in data processing, but not service outages | Kinesis processing latency |
Find example code how to run your first experiment here. AWS Fault Injection Simulator (FIS)
Chaos Engineering and Observability¶
Chaos Engineering and observability are closely linked concepts that complement each other in building and maintaining resilient and high-performing systems. While Chaos Engineering intentionally injects failures into a system to test its resilience, observability focuses on gathering, analysing, and visualising system data to understand its behaviour and performance.
Here's how Chaos Engineering and observability are linked and what you should know about their relationship:
Observability Is Crucial During Chaos Experiments: When conducting Chaos Engineering experiments, it's vital to have good observability in place. This allows you to monitor the system's behaviour and performance during the experiments, helping you understand how the system reacts to injected failures and stressors. You can then use this information to identify and address any uncovered weaknesses or issues.
Validate Hypotheses: Observability enables you to validate the hypotheses formed during experiments. By collecting and analysing real-time data, you can determine if your architectural assumptions hold true under pressure.
Detect Unintended Side Effects: Chaos experiments often reveal unexpected side effects or "ghost dependencies". Robust observability helps you detect these anomalies to understand and address their root causes.
Measure the Impact: Use observability to measure the exact impact of experiments on your KPIs, such as latency, error rates, and resource usage. This allows for a quantitative assessment of system resilience.
Improve System Understanding: Combining chaos experiments with observability provides a 360-degree view of your system. It links theoretical failure modes to observable data, making it easier to build and maintain high-performing infrastructure.
¶
Estimated time to read: 7 minutes