Intro¶

Chaos Engineering intentionally introduces failures, disruptions, or stress into a system to test its resilience and identify weaknesses. The goal is to improve the system's reliability and performance by discovering and addressing potential issues before they manifest in real-world situations, such as outages or degraded user experiences.

Why Chaos Engineering¶

There are several reasons why you may need Chaos Engineering:

Complex Systems: Modern software systems are increasingly distributed and complex, making it difficult to predict every failure scenario. Chaos Engineering helps you uncover these invisible dependencies before they fail in production.

Improve System Reliability: By proactively introducing failures and observing their impact, you can identify and address vulnerabilities before they lead to outages or performance degradation, significantly improving overall availability.

Continuous Learning: Chaos Engineering encourages a culture of constant learning. As you run experiments, you gain deep insights into your system's actual behaviour under stress, informing better architectural and operational decisions.

Mitigate Risk: Proactively fixing potential issues reduces the risk of unexpected outages that lead to customer dissatisfaction, loss of revenue, or reputational damage.

Optimise Resource Usage: Chaos experiments can highlight inefficiencies, such as underutilised servers or suboptimal configurations, providing clear data for cost savings and performance tuning.

Enhance Team Collaboration: Chaos Engineering fosters collaboration between development, operations, and security teams. This shared understanding ensures everyone is aligned on the requirements for a truly resilient system.

In summary, Chaos Engineering is a practical approach for testing and improving the resilience of your software systems. By proactively identifying and addressing potential issues, you can enhance system reliability, optimise resources, and foster a culture of continuous learning and collaboration within your organisation.

Planning Your First Chaos Experiment¶

Define the Scope: Begin by identifying the specific system, service, or component you want to test. Always start with a smaller, less critical part of your infrastructure before moving on to complex, core systems.

Identify Potential Weaknesses: Gather your team to brainstorm potential failure points. Consider internal and external dependencies, stateful data stores, and third-party API reliability.

Prioritise Scenarios: Rank identified weaknesses according to their likelihood and potential business impact. Focus your initial experiments on the most critical and probable scenarios first.

Formulate Hypotheses: For each scenario, develop a clear hypothesis about the expected outcome when the failure is introduced. Consider the impact on customers, dependencies, and service-level objectives (SLOs).

Define Key Performance Metrics: Establish metrics that correlate to business success (e.g., orders per minute, login success rate). Monitor these closely during the experiment to ensure the blast radius is controlled.

Design the Experiment: Create a detailed plan specifying the failure to be injected, the injection method (e.g., network latency, node termination), and the duration of the test.

Prepare Rollback Plans: Develop an immediate rollback plan to revert the experiment's impact. Ensure the team is ready to abort the experiment and return the system to its normal state if thresholds are breached.

Conduct the Experiment: Run the chaos experiment while closely monitoring KPIs and system behaviour. Be prepared to hit the "kill switch" if you observe unintended side effects.

Analyse Results: After completion, determine whether your hypothesis was correct. Identify any unexpected findings or "cascading failures" that were uncovered during the test.

Implement Improvements: Based on the data, identify the root causes of any failures and implement the necessary architectural or configuration changes to enhance resilience.

Repeat and Iterate: Chaos Engineering is a continuous process. Regularly re-run experiments after changes to ensure previously fixed weaknesses do not re-emerge.

Remember, Chaos Engineering is a proactive approach to improving system reliability. By regularly conducting chaos experiments, you can identify and address potential issues before they result in significant downtime or customer impact. So, embrace the process and have fun while improving your systems.

Example of Chaos planning¶

Scope	Potential Weaknesses	Hypotheses	Key Performance Metrics	Implementation Method	Metrics to Observe & Analyse
AWS EC2 Instances	Instance failure	An instance failure will not impact the overall service due to auto-scaling and load balancing	Latency, Request success rate	Terminate a random EC2 instance using AWS CLI or SDK, and observe the system's response	EC2 instance count, Auto Scaling events, Load balancer request distribution, CPU & memory usage
AWS RDS Database	Database connection failure	Connection failures will trigger auto-retry and fallback to read replicas without affecting users	DB connection errors, Latency	Introduce network issues between app and RDS using AWS Security Groups or VPC NACLs	Database connection errors, RDS replica lag, Query execution time, Application error rates
Kubernetes Nodes	Node failure	Kubernetes will reschedule affected pods to other healthy nodes without impacting user experience	Pod restart count, Node CPU usage	Drain a Kubernetes node using `kubectl drain`, then observe the pod rescheduling process	Node status, Pod status, Pod restarts, Node resource usage (CPU, memory)
Kubernetes Pod	Pod crash	Crashing pods will be automatically restarted, ensuring minimal user impact	Pod restart count, Latency	Introduce a fault within a pod, e.g., by using `kubectl exec` to kill a critical process	Pod status, Pod restarts, Container logs, Application error rates
AWS S3	S3 latency increase	Increased S3 latency will cause delays but not service outages, as retries and timeouts will be handled	S3 latency, Request success rate	Use a tool like AWS Fault Injection Simulator (FIS) to simulate latency increase in S3 API calls	S3 request latency, S3 error rates, Application response time
Kubernetes Service	Network latency between services	Increased network latency will cause delays but not service outages due to built-in retries and timeouts	Service-to-service latency	Inject latency between services using a service mesh like Istio, or using tools like `tc` or `iptables`	Service-to-service latency, Request success rate, Application response time
AWS DynamoDB	Throttling errors	Throttling errors will be handled by retries with exponential backoff, ensuring limited user impact	Throttling errors, Latency	Temporarily decrease DynamoDB provisioned capacity or use AWS FIS to simulate throttling errors	Throttling errors, Read/write capacity utilisation, Latency, Application error rates
Kubernetes Ingress Controller	Ingress controller failure	A failure in the ingress controller will cause temporary service disruption, which will be resolved quickly	Ingress error rate, Latency	Disable or introduce faults to the ingress controller, e.g., by modifying its configuration or scaling	Ingress error rates, Ingress latency, Ingress controller logs, Pod status
AWS Lambda	Lambda function timeout	Lambda timeouts will be handled by retries and fallbacks, ensuring limited user impact	Lambda timeouts, Latency	Modify the Lambda function to include an intentional delay, or reduce the function's timeout setting	Lambda invocation count, Lambda duration, Lambda error rates, Cold start count, Application error rates
AWS Kinesis Streams	Stream processing delays	Delays in processing Kinesis streams will cause	Stream processing delays	Delays in processing Kinesis streams will cause temporary lag in data processing, but not service outages	Kinesis processing latency

Find example code how to run your first experiment here. AWS Fault Injection Simulator (FIS)

Chaos Engineering and Observability¶

Chaos Engineering and observability are closely linked concepts that complement each other in building and maintaining resilient and high-performing systems. While Chaos Engineering intentionally injects failures into a system to test its resilience, observability focuses on gathering, analysing, and visualising system data to understand its behaviour and performance.

Here's how Chaos Engineering and observability are linked and what you should know about their relationship:

Observability Is Crucial During Chaos Experiments: When conducting Chaos Engineering experiments, it's vital to have good observability in place. This allows you to monitor the system's behaviour and performance during the experiments, helping you understand how the system reacts to injected failures and stressors. You can then use this information to identify and address any uncovered weaknesses or issues.

Validate Hypotheses: Observability enables you to validate the hypotheses formed during experiments. By collecting and analysing real-time data, you can determine if your architectural assumptions hold true under pressure.

Detect Unintended Side Effects: Chaos experiments often reveal unexpected side effects or "ghost dependencies". Robust observability helps you detect these anomalies to understand and address their root causes.

Measure the Impact: Use observability to measure the exact impact of experiments on your KPIs, such as latency, error rates, and resource usage. This allows for a quantitative assessment of system resilience.

Improve System Understanding: Combining chaos experiments with observability provides a 360-degree view of your system. It links theoretical failure modes to observable data, making it easier to build and maintain high-performing infrastructure.

¶

Estimated time to read: 7 minutes