Security Chaos Engineering 101: The Mind Map & Feedback Loop
This is the third article in the Security Chaos Engineering 101 series
. The first article laid the foundation for considering Security Chaos Engineering (SCE) as a practicable security engineering discipline. The second article provided technical details for implementing SCE experiments based on an AWS S3 public bucket use case. This third instalment discusses two important SCE concepts: the SCE mind map and the SCE feedback loop. The SCE mind map succinctly presents key fundamental SCE information colloquially known as the what, why, who, how and where of SCE. Conversely, the SCE feedback loop leverages the power of systems thinking to enable an objective approach to understanding and addressing cyber security and resilience challenges.
The SCE Mind Map
The Security Chaos Engineering (SCE) mind map fell out of discussions had with several people about SCE. Since SCE is relatively new and still evolving, there is a wide variation of understanding and several misconceptions. Some of these misconceptions were discussed in the last article. The SCE mind map succinctly presents some foundational knowledge explained in the next sections. You can access the Miro board used to construct his mind map on this link.
What Is Security Chaos Engineering
Security Chaos Engineering (SCE) applies chaos engineering principles to cyber security. It is a novel cyber security approach allowing for continuous, tangible and empirical evaluation of a systems security state. SCE's core value proposition is to empower the discovery of security blindspots, which helps escape the illusion of adopting a ‘false sense of security’. Leveraging SCE allows the collection of tangible evidence via security experimentation across various cloud-native abstraction layers.
Why Should Security Chaos Engineering Be Practiced
The intent for conducting an SCE experiment is encapsulated in learning from failures and being proactive. You want to get evidence about specific assumptions before making conclusions. Security decisions should not be hinged on gut feelings or vendor promises but on experimental evidence. There is room for knowledge from experience; however, this has to be balanced with empirical analysis. This approach allows for building confidence given the deep, practical understanding of a system's behaviour under unfavourable conditions. Ultimately, doing this repeatably will enable defenders to move from cyber security to cyber resilience, which is a huge achievement and the goal for most cloud security goals.
Where Should Security Chaos Engineering Be Practiced
Flexibility is paramount when deciding where to run SCE experiments. It's important to understand your maturity level around SCE and practice in a manner that does not produce unwanted results. It is safe to start running experiments in non-production environments and eventually move to production unless the environments are mapped one-to-one. However, it is difficult to have non-production environments that are identical to production environments. This might reduce the gains of running SCE since applying the lessons gained during experiments to production environments is great.
Who Should Practice Security Chaos Engineering
SCE is beneficial across several technical and non-technical domains. Due to the overlapping nature of modern organisational setups and the need to remove unnecessary silos, organisations adopting SCE should aim to involve several security teams in the SCE processes. These processes are designed around the SCE feedback loop, which will be discussed in the following sections. Incident response teams, detection engineers, (cloud) security engineers etc., can leverage SCE for several tasks, including compliance/security control validation, runbook verification, and detection algorithm implementation/improvements. Importantly, it is important to adopt the right mindset and right intent, more about these in the previous blog post.
How Should Security Chaos Engineering Be Practiced
The how question addresses the misconception that SCE experiments are necessarily designed to collapse a system's security. This notion is not only wrong but completely defeats the aim of SCE. Learning from the failure mindset is imperative to make the best of SCE. Hence the aim should be to run little experiments that have a clearly defined aim and objective. The gains are in the observations, often overlooked by defenders but intentionally sought after by attackers. Team maturity should be the yardstick for running more complex experiments, and these should graduate to gamedays. We recently organised a webinar around planning for security gamedays, do have a watch of the recording available on YouTube.
Security Chaos Engineering Feedback Loop
Feedback loops are a core aspect of systems thinking. They provide powerful ways for understanding a system, quickly detecting the limitations and advantages, and, based on these, making objective improvements. Feedback loops are very important in improving the security of our systems. However, most systems either lack a feedback loop or have very long periods before the loops are completed. Long feedback loops do not provide effective strategies in cyber security since the feedback might arrive when a system is already compromised, thus making it less useful. SCE mitigates this by allowing defenders to iterate quickly between security mechanisms to make fact-based decisions. This is highly optimal as it prevents attacks and allows defenders to practically move from security to cyber-resilience. The SCE feedback loop has five stages, which are explained below:
Plan
SCE is an iterative process hence planning for the initial or subsequent experiments is important. Plans should ideally be based on the outcome of previous experiments and could include the experiment's aim, hypotheses, and scope. These could be derived from backlogs from several security teams, including vulnerability management, SecOps, SOC and DevSecOps etc. Risks identified via other security activities, e.g. threat modelling and tabletop exercises, could be considered. In general, several factors based on the principles of chaos engineering have to be considered, such as how to limit the blast radius.
Execute
The plans are to be executed as closely as possible to the schedule as described in the previous section. Designated points to be executed include the experiment aim, hypothesis and scope. One important factor to consider is the safety measures, including premature stopping during experiments. This is normal and should be part of the plan. Due to this importance, it is necessary to have a plan to recover the test systems to their steady state. Given the above, it should be obvious how crucial effective communication strategies are. All involved teams must be kept in the loop using pre-arranged communication methods, including Slack and other communication/collaboration systems.
Monitor
High-quality monitoring approaches can be leveraged to gain and detect experiment outcomes. the progress of the experiment. Most teams would already have various approaches for achieving this, including using logs, observability tools and tracing systems. These approaches should not be limited to typical security tooling, as several options could be leveraged for deeper insights. Failure/success metrics could be predetermined to allow objectivity. Similarly, signals that give early warning for anomalies that require the termination of experiments should be carefully monitored.
Analyze
Experimental outcomes are to be analyzed to determine success/failure. however, there is more to that, the outcomes are not binary, a lot more can be derived from them such as deeper understanding of how security controls performs in the face of adversity, how teams can coordinate successfully, communication gaps, limitations of processes etc. The following questions are also interesting during the analysis: Was your hypothesis proven? Did you notice any unexpected behaviour or security events? Did the experiment disrupt any operational or business activities? Did you derive new knowledge, e.g. better detection techniques? Did the experiment fail or succeed? If it failed, what was the cause? How can the failure be prevented in the future?
A key aspect during the analysis is to determine the effect of the experimental outcome against organizational risk and cyber resilience. Analysing this might require several existing tools, such as the OWASP Risk Rating Methodology which considers several factors, including threat agents, attack vectors, security weaknesses, security controls, technical and business impacts.
Acquire Insights
Security experimentation yields high-quality insights into the true state of your cyber security posture. These insights could be leveraged in several ways, such as quality decision-making. Hence, retaining the knowledge gained in an insight knowledge base makes sense. Possible use cases for the acquired insights include security automation – e.g. AWS lambda-driven security automation. Further topics are to be discussed in other security activities, including tabletop and threat modelling exercises. The insights could be leveraged to train machine learning models to enhance threat detection systems.
Cloud Cyber Resilience WithThe Mitigant SCE Platform
Mitigant SCE platform aims to facilitate cyber-resiliency as a first-class citizen in cloud-native infrastructure. The platform consists of several attack actions and scenarios mapped to the MITRE ATT&CK framework, this mirroring real-world attacks. Mitigant SCE is suitable for companies of all sizes and allows quick and safe adoption of SCE without going through the cost and resource overhead. The cost of implementing an SCE strategy could be daunting for most enterprises. Mitigant solves these challenges by providing a SaaS platform.
The attacks are pluggable and can be leveraged as building blocks for constructing complex attack scenarios against AWS infrastructure. The platform enables safe and controlled SCE experiments, attacks can be started and stopped with button clicks, and all changes made to the cloud infrastructure are rolled back and restored seamlessly. Additionally, all attacks are documented, including constructed hypotheses and observations.
Sign up today for a free trial of the Mitigant SCE platform to help build cyber-resiliency for cloud infrastructure at https://mitigant.io/sign-up.