Security Chaos Engineering 101: Fundamentals
This article discusses the positioning of Security Chaos Engineering (SCE) as an aspect of security engineering. The primary motivation is to allow security engineers and other cyber-security professionals to view SCE as part of security engineering; it is not an esoteric craft. Adopting this mindset is vital for demystifying SCE; which is a prerequisite for gaining its inherent benefits. Furthermore, two misconceptions about SCE are addressed in this article to present objective information and clarity of knowledge.
Note: This is the first of a 2-part series, and some of the content was covered in our recent webinar - Getting Started With Security Chaos Engineering (Link to the recording).
See Part II here and Part III here.
Security Chaos Engineering as an Aspect of Security Engineering
Security Chaos Engineering (SCE) is a novel approach to cyber security; its core fundamentals are based on the principles of chaos engineering, though the objective is to enable cyber resiliency. For example, chaos engineering allows enterprises to survive outages caused by availability and performance-related faults. Conversely, by adopting SCE, enterprises can become resilient to cyber attacks, e.g. ransomware attacks.
Security Engineering
Security engineering focuses on designing and building systems and applications with security in mind. It involves applying security principles and best practices to the entire development lifecycle, from requirements analysis and design to implementation and testing. Security engineering aims to create systems that are secure by design and can resist various types of attacks and threats.
In his book “Security Engineering”, renowned cyber security expert and Professor of Security Engineering, Ross Anderson asserts “Security engineering is about building systems to remain dependable in the face of malice, error or mischance”. Ross further provides insights on the importance of conducting security tests as part of security engineering efforts. Ideally, security tests aim to ensure that systems are implemented as planned, and in later stages, for identifying vulnerabilities in the system. Therefore one way of looking at SCE is from this viewpoint, especially when applying SCE to application security.
SCE can be applied to the entire development lifecycle, especially across the 4Cs of cloud-native security. This article focuses on its application to the cloud infrastructure layer. Given that this dimension of security engineering falls under cloud infrastructure, it makes sense to refer to it as cloud security engineering. Cloud security engineers are responsible for several cyber security tasks that keep cloud infrastructure secure, these tasks are can be categorized under the umbrella term cloud security engineering. Typically, cloud security engineers are tasked/expected to perform SCE experiments, nevertheless SCE can be leveraged by different cyber security professionals. There are two applications of SCE in cloud security engineering: cloud security verification and cloud cyber resiliency verification.
Cloud Security Verification
SCE can be leveraged to verify cloud security controls; due to the complexity involved, verifying cybersecurity controls is not a commonly discussed topic. In traditional, on-premises systems, the output from security controls is largely taken as truth. However, this stance is quite risky in cloud infrastructure, cloud security controls are deployed together with infrastructure (e.g. Infrastructure-as-Code), thus increasing risks of misconfiguration-based vulnerabilities. Several other challenges that make cloud security difficult come to mind: potential compromise of cloud security controls, rapidly expanding cloud attack surfaces and increasingly complex infrastructure. Hence, continuously verifying cloud security controls is imperative for a healthy cloud security posture. Cloud security teams can leverage SCE to verify that security controls are performing as expected, i.e. preventing violations of security attributes (confidentiality, integrity and availability). For example, the image below illustrates a cloud security verification experiment aimed to verify if AWS GuardDuty sends a notification to a slack channel whenever an S3 bucket becomes public.
Cloud Cyber Resiliency Verification
The most crucial and disruptive aim of SCE is to enable cyber resiliency. This aspect is least understood because the definition and practice of cyber resiliency is often misunderstood, abused and quite abstract. The software and platform engineering disciplines have already formed a mature understanding of resilience engineering, including frameworks that fast-track practical implementation. For example, most cloud service provider already have a resiliency pillar as part of their Well-Architected Framework. In most cases, come aspects of cyber-resiliency are subsumed under resiliency/reliability, this does not really help cyber-security professionals as the definitions and requirements are often fuzzy. These differences can be clarified by establishing tools and approaches for implementing and verifying cyber resiliency, this is where SCE comes in.
It is essential to continuously verify the cyber resilient posture of cloud infrastructures, this can be achieved via a variety of approaches. One approach is by injecting actual attacks (most people prefer the term attack simulation) to observe how the infrastructure reacts. These attacks test the response of the cyber resiliency mechanisms. For example, by injecting a ransomware attack, it becomes evident whether the ransomware countermeasures are effective. The output is not binary; there are many sides to it. It might be interesting to measure metrics, e.g. the Mean Time to Respond (MTTR), Recovery Time Objective (RTO) and Recovery Point Objective (RPO). The two last metrics are essential aspects of the Reliability Pillar of the AWS Well-Architected Framework.
Misconceptions About Security Chaos Engineering
There are two biggest misconceptions about SCE: SCE is about large-scale, complex attacks and experiments should be large-scale and SCE starts in production environments. Let us discuss these misconceptions.
Misconception 01: SCE is About Large-Scale, Complex Attacks
There is an interestingly wrong impression that SCE is only achieved by launching large-scale, complex attacks. I suspect this way of thinking is highly influenced by the perception created by Netflix’s Simian Army, e.g. the Chaos Gorilla that took off entire AWS availability zones. It is instructive to know that Chaos Monkey was the first and original member of the Simian Army, and it did one thing - take one or more AWS instances offline. The remaining Simian Army members, e.g. Chaos Kong and Chaos Gorilla, were introduced to build on the initial success of Chaos Monkey. This step-by-step adoption of chaos engineering illustrates the need for organizational maturity.
It is, therefore, essential to start SCE with basic tests, e.g. taking AWS S3 buckets public to observe if security monitoring mechanisms trigger alerts and, even better, how fast the alerts arrive. This basic example (which we will dive into in the next blog post) already provides crucial learning opportunities for security and cyber-resiliency efforts and should be taken seriously.
Misconception 02: SCE Starts in Production Environments
Like the previous misconception, most folks who come across chaos engineering refer to Netflix and the definition of chaos engineering on the Principles of Chaos Engineering website. However, it is essential to understand that the engineering who popularized chaos engineering were skilled and quite experienced. Also, they worked in an organization facing huge availability challenges (risk profile) that also had the resources to tackle these challenges with a chaos engineering approach (budget). Accordingly, they could afford to move chaos engineering experiments to production quickly.
Most engineering practices start from a developer's local environment and gradually move to a development environment before finally arriving at production. This approach allows for a gradual and intentional build-up of contextual and practical knowledge. Adopting SCE should follow a similar process, start writing your scripts from the environment you feel the most comfortable in, improve and aim to eventually arrive at production because that is where you want to protect. The cadence of moving from one environment to another is determined by many factors: skill sets, company threat profile, business requirements etc.
As a yardstick, therefore, the following factors should be considered when contemplating moving security chaos tests to production: skills of the security engineering team, enterprise risk profile, business requirements and security budget. The bottom line is these factors are not blockers to adopting SCE; you can always start from your local or development environment.
Leveraging Mitigant SCE Platform to Adopt Security Chaos Engineering Seamlessly
The cost of building a security chaos engineering strategy is daunting for most enterprises. Moreover, the technical know-how is relatively unavailable. Mitigant solves these challenges by providing a SaaS platform.
Mitigant SCE platform consists of several cloud attacks which can be leveraged as building blocks for constructing complex attack scenarios against AWS infrastructure. The platform enables safe and controlled SCE experiments, attacks can be started and stopped with button clicks, and all changes made to the cloud infrastructure are rolled back and restored seamlessly. Additionally, all attacks are mapped to the MITRE ATT&CK library, enabling the implementation of real-world attacks in the wild.
Mitigant SCE platform aims to facilitate cyber-resiliency as a first-class citizen in cloud-native infrastructure. It is suitable for companies of all sizes and allows quick and safe adoption of SCE without going through the cost and resource overhead already highlighted above. Please do not hesitate to contact us if you want to adopt Security Chaos Engineering.