With the rise of distributed cloud architectures, the web has grown increasingly , yet failures have become much harder to predict.
These failures cause costly outages for companies. The outages hurt customers and get work done.
To avoid these failures , most companies need a solution to this challenge, waiting for the next costly outage is not an option. To meet the challenge head on, more and more companies are turning to Chaos Engineering.
What is chaos engineering?
Chaos engineering was introduced by Netflix, one of the largest media subscription services worldwide.
Chaos Engineering is designed to reveal weaknesses in our systems. A metaphor we often use is vaccination, where a potentially harmful agent is injected into the body for the purpose of preventing future infections. In a Chaos Engineering we inject failure into our systems to test their resilience ,stability, and capability of surviving against unstable and unexpected conditions.
Some of the types of failure that can be injected : shutting down hosts or containers, adding CPU load or memory pressure, and adding network latency or packet loss. There are others as well, but this gives you an idea what kind of things can be.
Doing this manually is time-consuming, and without realizing it you may be uncon‐ sciously sparing resources that you know are application-critical. To make it a fair test, the process must be automated.
This kind of automated, random interference with production services is sometimes known as Chaos Monkey , a tool invented by Netflix to test the resilience of its IT infrastructure. It works by intentionally disabling computers in Netflix’s production network to test how remaining systems respond to the outage. Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases.
Tools for chaos engineering with Kubernetes?
There are several tools you can use for automatically chaos engineering your cluster. Here are a few options :
Kube-monkey is an implementation of Netflix’s Chaos Monkey for Kubernetes clusters. It randomly deletes Kubernetes pods in the cluster encouraging and validating the development of failure-resilient services.
kube-monkey runs at a pre-configured hour on weekdays, and builds a schedule of deployments that will face a random Pod death sometime during the same day. It works on an opt-in model and will only schedule terminations for Kubernetes apps that have explicitly agreed to have their pods terminated by kube-monkey ,it is done by setting the following labels on a k8s app:
- kube-monkey/enabled: Set this app to have their pods terminated by kube-monkey
- kube-monkey/mtbf: Mean time between failure (in days).
- kube-monkey/identifier: A unique identifier for the apps. This is used to identify the pods that belong to a kubernetes app as Pods inherit labels from their k8s app. So, if kube-monkey detects that app foo has enrolled to be a victim, kube-monkey will look for all pods that have the label to determine which pods are candidates for killing.
kube-monkey/kill-mode: Default behavior is for kube-monkey to kill only ONE pod of your app. You can override this behavior by setting the value to:
- “kill-all if you want kube-monkey to kill ALL of your pods regardless of status. Does not require kill-value.
- fixed if you want to kill a specific number of running pods with kill-value.
- random-max-percent to specify a maximum percentage with kill-value that can be killed.
- fixed-percent to specify a fixed percentage with kill-value that can be killed.
kube-monkey/kill-value: Specify value for kill-mode
- fixed, provide an integer of pods to kill
- random-max-percent, provide a number from 0-100 to specify the max % of pods kube-monkey can kill
- fixed-percent, provide a number from 0-100 to specify the % of pods to kill
For the installation you have two choices :
Manually : First you must create the configMap in the namespace you intend to run kube-monkey in. Make sure to define the keyname as config.toml. Next run kube-monkey as a kubernetes app within the Kubernetes cluster, in a namespace that has permissions to kill Pods in other namespaces. You should be able to see debug logs by : kubectl logs -f deployment/kube-monkey –namespace=kube-system
Here in my example you can see the logs of my kube-monkey instance the Scheduling time is set to 04/11/2019 17:00 where will generate a list of eligible k8s apps to determine if a pod for that k8s app should be killed today:
After the Scheduling time here you can see the list of eligible app to be killed and when :
When Termination time will be done successfully your logs will be updated :
During Termination time you can run for example load testing tool like locust to ensure that shutting down hosts and containers do not affect high availability of your apps.
- Manually : You can use following manifest as an inspiration.
By default chaoskube will be friendly and not kill anything. When you validated your target cluster you may disable dry-run mode (–no-dry-run).
As you see for example in the above manifest , i set no-dry-run so it will really kill target pods that have as label chaoskube=true and are not in the kube-system namespace (–namespaces=!kube-system). You can also exclude some specific days or times . You can find all accepted flags here.
Unlike kube-monkey who can kill more than one pod in an attempt ,chaoskube will kill only one pod per attempt so to increase the amount of chaos you can increase the number of replicas of your chaoskube deployment.
In this example ,i show the flags included in my chaoskube instance :
I set the intervals to 4 minutes so every 4 minutes chaoskube will kill a target pod that did not have app=chaoskube as label does not belong to kube-system namespace …
After executing this command :
kubectl logs deploy/chaoskube -f you can see the logs of my chaoskube instance after killing some target pod :
PowerfulSeal adds chaos to your Kubernetes clusters, so that you can detect problems in your systems as early as possible. It kills targeted pods and takes VMs up and down. I think it is the real implementation of Netflix’s Chaos Monkey for Kubernetes clusters.
PowerfulSeal works in several modes :
Interactive mode allow you to discover your cluster’s components and manually break things to see what happens.
Autonomous mode reads a policy file, which can contain any number of pod and node scenarios and will be executed in a loop.
Label mode allows you to specify which pods to kill with a small number of options by adding seal labels to pods.
Demo mode allows you to point the Seal at a cluster and a metrics-server server and let it try to figure out what to kill based on the resource utilization.
Running inside of the cluster :
The setup involves :
- Creating RBAC rules to allow the seal to list, get and delete pods.
- Creating a powerfulseal configmap and deployment your scenarios will live in the configmap
- If you’d like to use the UI, you’ll probably also need a service and ingress
The Seal will self-discover the way to connect to kubernetes and start executing your policy.
Running outside of the cluster :
If you’re running outside of your cluster, the setup will involve:
- Pointing PowerfulSeal at your Kubernetes cluster by giving it a Kubernetes config file (
- Pointing PowerfulSeal at your cloud by specifying the cloud driver to use and providing credentials (for example
--gcp --gcp-config-file /path-to-config)
- Making sure the seal can SSH into the nodes in order to execute docker kill command (
- Writing a set of policies
- Pointing PowerfulSeal at your Kubernetes cluster by giving it a Kubernetes config file (
The setup should look something like this :
So an example of command will be like this :
powerfulseal interactive --kubeconfig ~/.kube/config --gcp --inventory-kubernetes --ssh-allow-missing-host-keys --ssh-path-to-private-key ~/.ssh/google_compute_engine
In this example ,i show the pod and node scenarios in my powerfulseal instance ,pod scenario will kill pods in default namespace and have the label app=hello ,for the node scenario it will stop the node that had the name indicated and wait 30 second to restart it again :
When powerfulseal starts to run scenarios your logs will be updated :
Compare to other tools, powerfulseal has the advantage of having Web User Interface so you can use it to describe scenarios instead of using yaml file :
Users can make use of other different tools with all available chaos engineering tools to extend the functionalities. Below are some of the applications that can be used along with Powerfulseal to broaden the test :
- Goldpinger : A debugging tool for Kubernetes which tests and displays connectivity between nodes in the cluster.
- Locust : A distributed load testing tool which enables users to run load tests on distributed deployments. Locust supports a distributed mode (one master and multiple slave nodes).
Building the most effective system requires a lot of experience and knowledge . Chaos engineering allows us to run precise scenarios that could happen any time while a product or service is used. Everyone will know what to look for in the future, and what systems might be vulnerable.
So chaos proves the quote :
Failure is a success if we learn from it.
If you have any feedback please leave a message in the comment section or feel free to get in touch with me 👍