In today’s digital age, systems are becoming increasingly complex, and as a result, it’s becoming more difficult to predict and prevent failures. Chaos engineering is a discipline that aims to improve system resilience by intentionally injecting chaos into the system and measuring its response.
What is Chaos Engineering?
Chaos Engineering is the intentional act of introducing and injecting unexpected events like errors, failures, system misconfigurations, or attacks in a system to measure how resilient, fault-tolerant, and reliable they are.
It involves simulating unexpected events or failures in the system to identify weaknesses and improve fault tolerance.
FROM “PRINCIPLES OF CHAOS ENGINEERING”
The difference between Chaos Engineering and Testing
I’ve explained briefly what is chaos engineering, but you may be wondering why to use it in the first place. What is it that Chaos Engineering practice does, that unit tests, performance tests, integration tests, and end-to-end tests are not doing?
With Unit testing, we write a unit test case and check the expected behaviors of a component that is independent of all external components, by isolating it from everything else, whereas Integration testing checks the interaction of individual and inter-dependent components. The results do not reveal any new information on the application, system behavior, performance, or properties.
Whereas the chaos engineering approach can test the application behavior in a wide range of unexpected conditions and unpredictable outcomes of failures and faults.
One more way to look at the difference between these two approaches is from the outcome perspective: We can state that testing seeks to verify results against an expected outcome, whereas chaos engineering defines the desired outcome but then tries to disprove it.
Benefits of conducting Chaos Engineering
Let’s explore the important benefits of chaos engineering and see what we can gain.
- Increases Reliability and Resilience
By proactively injecting chaos into the application through unexpected events and errors, crashes, and attacks, we can design and implement countermeasures and safeguards beforehand, thus increasing fault tolerance, error handling, and recovery, eventually making the application more resilient and robust. - Can increase end-user and stakeholder satisfaction
As the application is being tested through various scenarios against chaos, and by implementing countermeasures to these scenarios, we make the production environment more stable and performant.
Fewer system crashes and errors, more stability, and performance, will raise confidence in the design and increase user satisfaction with it. - Can improve response times to incidents
Troubleshooting, repairs, and incident reactions can be performed more quickly now that the technical crew has been trained and brought up to speed from earlier chaos experiments.
As a result, insights gained through chaotic testing can lead to shorter reaction times for future production issues. - Improve application performance and monitoring
Doing Chaos experiments on a regular basis can help you build more accurate monitoring reports, metrics, and tools. By detecting weak areas in the application that require performance improvement or better fault tolerance, for instance, we can update our monitoring dashboard to reflect the application’s behavior in these areas.
Chaos Engineering has recently evolved into an excellent tool that may help organizations improve not only overall system resources, resiliency, flexibility, and velocity, but also the operation of distributed systems.
Along with these benefits that I’ve just discussed, it can help detect and address problems before they can negatively affect the application in production.
Chaos Engineering in Practice
Netflix originally designed the term ‘Chaos Engineering’ and put together a set of core principles to help design Chaos Engineering experiments. These principles are available online at PRINCIPLES OF CHAOS ENGINEERING – Principles of chaos engineering.
Core Principles:
- Define the “normal and steady” state of the system.
Focus on what is considered a steady state. what metrics affect this state? and define how the system is behaving in this state. - Create a hypothesis that the steady state will not change during both the control group and the experimental group.
In other words, we hypothesize that if something bad happens (storage failure, network connectivity loss, etc…), it will not have any impact on the application or system, and the system will still be available for usage. - Apply real-world events
Create and apply real-world events, that might happen in production. In this step, we want to test our hypothesis in step #2, hence we intentionally create events that we assume should have an effect on the system. Shut down servers randomly, disconnect storage, add latency to network traffic, overload the CPUs, and so on. - Observe the analyze the results of the previous step
Compare the steady-state metrics and system behavior from step 1 with step 3. You can various tools to get and analyze the metrics (Cloudwatch, Kibana, and the like).
If there is a difference between how we assumed the application will behave and how it actually behaved, then this could be a candidate to introduce improvements to the application to protect against the event that caused this behavior.
for example, when there was a network disconnect for 5 seconds in service X, the incoming requests rate failing with HTTP 500’s reached 95%, basically causing the service X to be almost completely unavailable and unusable.
Advanced Principles:
- Build a Hypothesis around Steady State Behavior
When creating the hypothesis, focus on measurable metrics. Metrics like error rate, network latency, responsiveness, etc.. - Vary Real-world Events.
Make the events are close to real-life as possible and do not “stick” to the same events over and over again. This principle encourages making unknown variables as close to real-life events as possible. - Run Experiments in Production.
Yes, production.
Not some dev environment, not a test, but a real-life production environment serving customers.
Why? you want to experiment in production since that is the “real” system. If you perform chaos experiments only during staging or integration, you cannot get a real picture of how the system in production behaves. Of course, you should do this step only after having done enough rounds of chaos in dev/test environment, and you are ready to move to the next phase. - Automate Experiments to Run Continuously
You want to automate our experiments to run continuously or be executed as part of the CD pipelines. This could mean every few hours, every day, week, or every time some event is happening in the system.
You should also run experiments every time you are deploying a new release. - Minimize Blast Radius
You should make the experiment’s blast radius smaller. Starting small and expanding as you gain confidence in a system are the best approaches when conducting chaos experiments. You should eventually conduct experiments on the entire system
Chaos Engineering Tools
Chaos engineering is becoming an increasingly popular approach to proactively identifying and addressing potential issues in complex systems, and the right tools can make all the difference in implementing this approach effectively.
Let’s take a look at some of the popular Chaos Engineering tools, commercial and open-source.
Chaos-Mesh
Platform: Kubernetes
License: Open-source
Chaos Mesh is an open-source Chaos Engineering tool designed to help users simulate and identify potential failures in distributed systems. Developed by the team at PingCAP, Chaos Mesh offers a comprehensive set of features that allow users to orchestrate and control chaos experiments across a wide range of system components, including network connections, disk I/O, and CPU performance.
Chaos Mesh offers a broad range of features that can address various faults that may arise in a distributed system.
Some of the features it offers are:
- Network partition: to simulate network failures and partition scenarios.
- Pod and container killing: to test the system’s resilience against unexpected pod and container failures.
- CPU and memory stress: to observe how the system behaves under resource pressure.
- Kernel chaos: to simulate kernel-level faults, such as kernel panics and I/O errors.
- DNS chaos: to simulate DNS failures and test the system’s fault tolerance.
- Filesystem chaos: to test how the system handles file system errors and data corruption.
More information on Chaos Mesh is available at A Powerful Chaos Engineering Platform for Kubernetes | Chaos Mesh (chaos-mesh.org)
Litmus Chaos
Platform: Kubernetes
License: Open-source, Apache 2.0
Litmus offers a wide range of fault libraries to test containers, hosts, and various platforms such as Amazon EC2, Apache Kafka, and Azure. The tool includes a user-friendly web interface called ChaosCenter, a publicly available repository of experiments called ChaosHub, and can be easily installed via the official Helm Chart. Although Litmus is open-source, the project is owned by Harness.
Feature Highlights (Part of Chaos Center):
- Chaos Scenario Creation: providing a rich feature set on the creation of scenarios, templating, scheduled tests, and more.
- Users and Teams: User creation and management, permissions, and authentication.
- Monitoring & Observability: Monitor, analyze, and visualize chaos scenarios with some analytics capabilities like queries and dashboards.
More information on Litmus Chaos is available at LitmusChaos – Open Source Chaos Engineering Platform.
Chaos Toolkit
Platform: Docker, Kubernetes, Cloud Platforms, Bare-Metal
License: Open-source, Apache 2.0
Chaos Toolkit is an open-source, CLI-based Chaos Engineering tool that you can use to run experiments on various platforms and environments. It makes it easier to perform Chaos Engineering experiments by providing a framework for defining and executing experiments, as well as a set of pre-built plugins for interacting with different systems and services. Its extensibility contains drivers for working with platforms like AWS, Azure, GCP, K8S, and more, Application and Network plugins (Like for WireMock for example), Load Testing, Notifications, Observability, and Reliability.
Each experiment consists of Actions and Probes. Actions execute commands on the target system, and Probes compare executed commands against an expected value.
Feature Highlights:
- Experiment Definition: Chaos Toolkit allows you to define experiments in a declarative way using YAML or JSON files. You can define the experiment steps, the target system, and the experiment configuration in these files. This makes it easy to share and version experiments, and automate their execution.
- Extensibility: Chaos Toolkit comes with a set of pre-built plugins for interacting with different systems and services, such as AWS, Kubernetes, and Docker. You can also create your own plugins to interact with custom systems or services. This makes it easy to extend the functionality of Chaos Toolkit and integrate it with your existing systems.
More information on Chaos Toolkit is available at https://chaostoolkit.org
Gremlin
Platform: Cloud, Bare Metal, Docker, Kubernetes
License: Commercial
Gremlin is a Software-as-a-Service tool that offers a rich set of features and capabilities to help you conduct Chaos Engineering experiments. With its wide range of possible failure scenarios – including CPU attacks, traffic loads, and other potential problems that large, complex digital infrastructures might experience – Gremlin enables teams to proactively identify and address weaknesses in their systems.
One of the key advantages of Gremlin is its strong automation support. It provides multiple options for automation, including CLI, UI, and API, so that teams can easily automate their attacks as part of their CI/CD pipelines.
Another benefit of Gremlin is its ability to automatically detect infrastructure components and recommend experiments to identify common failure modes. In addition, the platform can automatically cancel experiments if systems become unstable, helping to minimize the risk of damage or downtime.
A Few More (Notable) Chaos Engineering Tools to Check Out
- AWS Fault Injection Simulator
Stress Testing Tools – AWS Fault Injection Simulator – Amazon Web Services - Steadybit
Steadybit – Chaos Engineering and Resilience platform - KubeMonkey
asobti/kube-monkey: An implementation of Netflix’s Chaos Monkey for Kubernetes clusters (github.com) - Pumba: chaos testing tool for Docker
alexei-led/pumba: Chaos testing, network emulation, and stress testing tool for containers (github.com) - Toxiproxy
Shopify/toxiproxy: A TCP proxy to simulate network and system conditions for chaos and resiliency testing (github.com)
Sample test plans
Basic test plan sample of an experiment about pod failure in a K8S cluster.
- Objective: To test the overall application and cluster resilience by simulating a Pod failure.
- Scope: The test will focus on a specific Kubernetes deployment, on specific application service pods, and will involve intentionally killing or shutting down a pod of the deployment.
- Steps:
- Identify the deployment to be targeted for the test and the pod to be killed.
- Use Kubernetes CLI or API to identify the pod’s name, namespace, and label selector.
- Define a Chaos Experiment that will target the pod for deletion. For example, the experiment could use a Kubernetes Pod Killer to delete the targeted pod.
- Configure the Chaos Experiment to run during a period of low traffic or usage to minimize the impact of the failure.
- Run the Chaos Experiment and monitor the system to observe how it responds to pod failure.
- Once the experiment is complete, use Kubernetes CLI or API to confirm the pod has been deleted and the deployment is still running properly.
- Collect and analyze the results of the experiment to determine the resiliency of the Kubernetes cluster to this type of failure.
- Collect and analyze the results of the experiment to determine the resiliency of the Kubernetes cluster to this type of failure.
- Expected Outcome: The experiment should demonstrate the ability of the Kubernetes cluster to handle the failure of a single pod without causing a system-wide outage or degradation in performance.
Basic test plan sample – database connection loss
- Objective: To test the resiliency of an application by simulating a database connectivity loss.
- Scope: The test will focus on a specific application and will involve intentionally causing a loss of connectivity to the application’s database.
- Steps
- Identify the application to be targeted for the test and the database to be disconnected.
- Define a Chaos Experiment that will disconnect the database. For example, the experiment could use a tool like Gremlin to stop the database service or block traffic to the database.
Note: Here it depends on which service or tools you will be using to conduct the experiment. You should refer to the tools’ documentation on how to implement this experiment. - Configure the Chaos Experiment to run during a period of low traffic or usage to minimize the impact of the failure.
- Run the Chaos Experiment and monitor the system to observe how it responds to the loss of database connectivity.
- Once the experiment is complete, use database management tools to confirm the database is still running properly and data integrity has been maintained.
- Collect and analyze the results of the experiment to determine the resiliency of the application to this type of failure.
- Expected Outcome: The experiment should demonstrate the ability of the application to gracefully handle a loss of database connectivity without causing a system-wide outage or degradation in performance.
Detailed test plan – Authentication Service Outage & Database Network Latency
- Objective: To ensure system resilience and stability when the authentication service is down and there is network latency while accessing the database.
- Introduction
The purpose of this test plan is to outline the procedures and strategies for testing the system’s behavior under adverse conditions, specifically when the authentication service is down and there is network latency while accessing the database. This will help identify potential weak points, improve system performance, and ensure that the system can recover gracefully from these issues. - Scope
The test will cover the following components:- Authentication service
- Database access and performance
- User experience during the service outage and latency
- System alerts and monitoring
- Test Approach
The test will be conducted in a controlled environment that closely resembles the production environment. We will use chaos engineering tools and techniques to simulate the failure of the authentication service and induce network latency while accessing the database. - Test Scenarios
- 4.1 Authentication Service Outage
- Simulate an authentication service outage by either stopping the service or blocking network access to it.
- Monitor system alerts and notifications for any service degradation.
- Test user login attempts during the outage and verify error messages and system behavior.
- Check if the system can recover gracefully once the authentication service is back online.
Verify monitoring, logs, service logs, and user behavior.
- 4.2 Network Latency while Accessing the Database
- Use a network simulation tool or a chaos engineering tool to induce network latency while accessing the database.
- Monitor the performance of database queries and the system’s ability to handle the latency.
- Test user interactions that involve database access and verify response times and error handling.
- Check the system’s ability to recover from the latency issue once the network conditions are back to normal.
- 4.3 Combined Scenario – Authentication Service Outage and Database Network Latency
- Simulate both the authentication service outage and network latency while accessing the database concurrently.
- Monitor the system behavior, alerts, and notifications for any service degradation.
- Test user interactions during this combined scenario and verify error handling and response times.
- Check if the system can recover gracefully once the authentication service is back online and network latency is resolved.
- 4.1 Authentication Service Outage
- Test Metrics and Acceptance Criteria
- Response times for user interactions during the test scenarios
- Error rates and error handling during the test scenarios
- System recovery time after resolving the test scenarios
- Alert and notification accuracy and timeliness
Conclusion
In conclusion, Chaos Engineering is an essential approach to proactively identifying and addressing system vulnerabilities, ensuring robust performance and reliability in today’s complex and ever-evolving technology landscape. By embracing Chaos Engineering, organizations can uncover potential weak points, optimize system resilience, and maintain a positive user experience, even under adverse conditions.
The numerous benefits of adopting Chaos Engineering range from improved system stability to reduced downtime, ultimately leading to increased customer satisfaction and trust.
With a variety of popular tools available, implementing Chaos Engineering is now more accessible than ever. The sample test plans provided in this blog post offer a starting point and a basic reference of how test plans can be structured and executed.