Demystifying Security Chaos Engineering - Part I

We are witnessing an upsurge of high-profile attacks in recent times, and the attacks that impacted prominent companies are the most...
22.10.2022
Kennedy Torkura
6 min read
Demystifying Security Chaos Engineering - Part I
Contributors
Kennedy Torkura
Kennedy Torkura
Co-Founder & CTO
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

We are witnessing an upsurge of high-profile attacks in recent times, and the attacks that impacted prominent companies are the most appalling! One of the most chilling facts about these attacks is the success rate against security controls considered robust, e.g., Multi-Factor Authentication. Clearly, cybercriminals are outpacing modern cybersecurity mechanisms, and novel approaches are imperative to address these concerns.

Consequently, two schools of thought have emerged; one believes the industry needs to evolve more security approaches and the other argues for resilient cyber mechanisms. This blog post supports the latter, given the potential of leveraging Security Chaos engineering to enable cyber resilience.

Note: This is the first part of a two-part series. Please subscribe to be informed when Part II is published. Also, this blog is based on a talk presented at the Cyber Security & Cloud Expo (Europe) 2022.

Chaos Engineering - The Origins

The origins of chaos engineering can be traced to Netflix's migration from an on-premises data center to cloud infrastructure -  Amazon Web Service (AWS). During Netflix's early days in the cloud, workloads were primarily deployed on EC2 instances (it was the state-of-the-art in cloud computing back then). Strangely though, EC2 Instances would shut down without warning. As you can imagine, the impact was unacceptable as this behaviour introduced serious availability issues. Netflix customers could be cut off for a while and probably reconnected afterwards, and the immediate effect would be a bad customer experience.

Netflix Chaos Monkey

Unfortunately for Netflix, AWS had no solution to these failures; hence innovating a solution was imperative! Enter chaos engineering; the basic idea was to evolve systems that could tolerate the menace of unpredictable dying EC2 instances. Consequently, Netflix implemented Chaos Monkey, which automatically and intentionally injects availability failures. The main job of Chaos Monkey was to kill EC2 instances and other services randomly. Unfortunately, this would effectively cause the very failures that occurred unpredictably.

Figure 1. Netflix Simian Army

Crazy as it sounds, Chaos Monkey performed remarkably well; the engineering teams evolved by implementing systems that survived dying EC instances. Due to the success of Chaos Monkey, Netflix developed other tools based on the Principles of Chaos Engineering, these set of tools were known as the Netflix Simian Army. By continually deploying these tools and adopting a learning from failure mindset, Netflix has survived outages that took out entire AWS regions, e.g. the US East 1 outage.

Growing Adoption of Chaos Engineering

The success of Netflix's Simian Army popularized chaos engineering and has encouraged its adoption. Today, several open-source projects and commercial products offer relatively easy-to-use chaos engineering capabilities. Similarly, most cloud service providers provide chaos engineering services: AWS Fault Injection Simulator, AWS Resilience Hub, and Azure Chaos Studio. These tools and services focus on leveraging chaos engineering to prevent availability failures. Unfortunately, the security industry is yet to jump on this bandwagon despite the unique benefits of applying chaos engineering principles to cyber security.

Figure 2. An Example of Running Chaos Engineering Experiments using the AWS Fault Injection Simulator (Source: AWS FIS Blog Post )

Cloud Native Security Landscape

Primarily, Chaos Monkey enabled availability resilience for Netflix, i.e. their infrastructure is resilient against availability failures. Interestingly, availability is one of the critical attributes for allowing a security system, also called the CIA triad. The other key attributes are confidentiality and integrity. Essentially, every security control aims to prevent violations of one or more characteristics of the CIA triad. However, current cloud-native security mechanisms struggle to achieve this aim, and the reasons are not far-fetched. Here are some of our thoughts on why attacks are still successful regardless of the evolving cloud-native security mechanisms:

Complexity: The Enemy of Security

Cloud-native infrastructure enables several advantages, including scalability, elasticity, and (perceived) cost saving; however, alongside these advantages, complexity is inherited. The complexity results from multiple abstraction layers that underpin cloud-native infrastructure. Bruce Schneier asserted that "complexity is the greatest enemy of security," and precisely that, we are experiencing the effect of complex systems on security objectives. Complex systems are hard to understand, mainly through the lens of cyber security, and the efficiency of any security architecture depends on the depth of understanding grasped by defenders. Furthermore, better insights into the workings of any system ultimately facilitate creative tooling support and innovative deployment of security controls when standard approaches are limited.

Dynamic Security Posture

Cloud infrastructure allows for agility, empowering teams to continuously deploy infrastructure to meet market demands while gaining competitive advantages against competitors. This directly increases productivity and paves the path for practising modern techniques, e.g., DevOps and GitOps. However, each cloud infrastructure change potentially introduces security issues, e.g., misconfiguration. Hence, these changes make it harder to maintain a consistent security posture, and this is. Challenge! CISOs and other security leaders want an educated perception of their infrastructures' security posture. Unfortunately, this is hard to achieve due to ephemeral cloud-native infrastructure.

Misconfigurations - Root Cause of Cloud Attacks

Misconfigured cloud assets remain one of the most prevalent causes of cloud breaches. Gartner asserted that misconfigurations would cause 99% of cloud attacks until 2025. It is important to note that this prediction includes all cloud resources, including the cloud security resources. So regardless of how efficient a cloud security mechanism might be, its effectiveness is eroded if not well configured. Furthermore, misconfigurations are often introduced from various sources, including during deployments and routine maintenance.

Figure 3. Cloud-Native Attack Surface Showing Multiple Attack Paths that Transpire Across Several Abstraction Layers (4Cs of Cloud-Native Security)

Security Silos - Introduce Blindspots

The cloud operating model builds on multiple layers of abstraction. Accordingly, security mechanisms are designed to align with these abstraction layers to achieve a `Defense-in-Depth`  model. This model, also referred to as the 4Cs of cloud-native security, proposes positioning security systems at the four abstraction layers: code, container, cluster, and cloud. While this model has multiple advantages ad protects to a large extent, it fails to address multi-layered attacks. This failure results from a siloed security architecture, mainly when the cloud-native security systems deployed at the various abstraction layers operate independently, i.e., without synergizing. The impact is attacks that transpire across two or more abstraction layers are not easily detected. Ultimately, end users risk having a false sense of security, a situation where all seems normal and secure until an attack becomes successful, aka security theatre.

Security Chaos Engineering

It is becoming increasingly clear that security in cloud-native infrastructure is more about resilience than "just" security. Unfortunately, despite the vast amount of cloud-native security products appearing daily, breaches still occur!

Firefighting Versus Fire Resilience

The immense width of the cloud-native attack surface and possibilities for attacks requires a shift of mindset from `firefighting` to becoming resilient against fires (fire resilience), as rightly asserted by DinoDai Zovi. Cloud-native security should balance keeping attackers out and fighting/resisting attacks. This calls for a mind-shift from attack prevention to an `Assume Breach` mindset. Werner Vogel, CTO of Amazon, declared:" Failures are a given, and everything will eventually fail over time," similarly, security failures are inevitable in the cloud. It is, therefore, imperative to shift focus to attack detection, recovery, and resistance.

Figure 4. Tweet by Dino A. Dai Zovi About Adopting A Fire Resilience Mindset (Source: Twitter)

Assume Breach Mindset

Security Chaos Engineering is an enabler of confidently adopting an assume breach mindset. Similar to how Chaos Engineering has enabled resilience against availability failures, Security Chaos Engineering enables resilience against integrity and confidentiality failures (including availability). The same principles of chaos engineering are applicable, though adapted to fit desired security objectives. By injecting security failures into cloud-native infrastructure, the actual behaviour of security controls becomes apparent through observations. These observations result in empirical and tangible knowledge which can be leveraged for proactive and iterative security hardening.

Figure 5. Mitigant's Security Chaos Engineering Platform

Mitigant's Security Chaos Engineering Platform

Implementing Security chaos engineering from the ground up could be a burdensome task for most enterprises. In addition, the technical know-how is relatively non-existent, and the time and effort required are barely affordable for most enterprises. Consequently, given the knowledge and experience gained from an academic research background and industry experience, Mitigant's founders are well-positioned to commoditize security chaos engineering.

We want a future where every enterprise can leverage security chaos engineering to become resilient to cloud attacks. Hence, we have built a SaaS offering that allows easy adoption, drastically reducing the steep knowledge and skills that are otherwise required. For more practical use cases, read Part I and Part II of our blog posts on defeating ransomware with Security Chaos Engineering. The second part of this blog article will provide more exciting insights into Security Chaos Engineering, so subscribe to our blog post to get informed.

Ready to Secure Your Cloud Infrastructures?
Connect with the Mitigant Team and proactively protect your clouds today.

Join The Cloud Security Revolution Today!

Take control of your cloud security in minutes. No credit card required.
Start 30-day Free Trial