Chaos engineering best practice

Question

I studied the principles of chaos, and looks for some opensource project, such as chaosblade which is open sourced by Alibaba, and mangle, by vmware.

These tools are both fault injection tools, and do nothing to analysis on the tested system.

According to the principles of chaos, we should

1.Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.

2.Hypothesize that this steady state will continue in both the control group and the experimental group.

3.Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.

4.Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.

Is there any good suggestions or best practice?

score 2 · Answer 1 · answered Mar 19 '20 at 05:45

so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.

As always the answer is it depends.... It depends how do you want to measure your hypothesis, it depends on the hypothesis itself and it depends on the system. But normally it makes totally sense to introduce metrics to improve/increase the observability.

If your hypothesis is like Our service can process 120 requests in a second, even if one node fails. Then you could do it via metrics to measure that yes, but you could also measure it via the requests you send and receive the responses back. It is up to you.

But if your Hypothesis is I get a response for an request which was send before a node goes down. Then it makes more sense to verify this directly with the requests and response.

At our project we use for example chaostoolkit, which lets you specify the hypothesis in json or yaml and related action to prove it.

So you can say I have a steady state X and if I do Y, then the steady state X should be still valid. The toolkit is also able to verify metrics if you want to.

I think， if the steady state is simple, for example, only one metric, It can be get by calling restful interface. But if you want to see the effect of the fault injection, I think it's better to use monitoring system to see which metric(s) changes, during this process. May use some technologies like anomaly detection in AIOps domain. — NingLee, Mar 24 '20 at 08:26

Evgeny · Answer 2 · 2020-08-29T15:28:19.123

The Principles of Chaos are a bit above the actual testing, they reflect the philosophy of designed vs actual system and system under injection vs baseline, but are a bit too abstract to apply in everyday testing, they are a way of reasoning, not a work process methodology.

I'm think the control group vs experiment wording is one especially doubtful part - you stage a test (injection) in a controlled environment and try to catch if there is a user-facing incident, SLA breach of any kind or a degradation. I do not see where there is a control group out there if you test on a stand or dedicated environment.

We use a very linear variety of chaos methodology which is:

find failure points in the system (based on architecture, critical user scenarios and history of incidents)
design choas test scenarios (may be a single attack or more elaborate sequence)
run tests, register results and reuse green for new releases
start tasks to fix red tests, verify the solutions when they are available

One may say we are actually using the Principles of Choas in 1 and 2, but we tend to think of choas testing as quite linear and simple process.

score 0 · Answer 3 · answered Mar 11 '21 at 09:35

0

Mangle 3.0 released with an option for analysis using resiliency score. Detailed documentation available at https://github.com/vmware/mangle/blob/master/docs/sre-developers-and-users/resiliency-score.md

answered Mar 11 '21 at 09:35

Hemanth Kumar Kilari

1

3

A link to a solution is welcome, but please ensure your answer is useful without it: [add context around the link](//meta.stackexchange.com/a/8259) so your fellow users will have some idea what it is and why it’s there, then quote the most relevant part of the page you're linking to in case the target page is unavailable. [Answers that are little more than a link may be deleted.](/help/deleted-answers) – STA Mar 11 '21 at 09:36

Chaos engineering best practice

3 Answers3