Cloud Security Lab a Week (S.L.A.W)
Posts
Epic Cloud Security Automation: Fixing the Broken RCP

Epic Cloud Security Automation: Fixing the Broken RCP

Due to limits in AWS condition keys, the Resource Control Policy we created won't work. We'll start implementing advanced automation to achieve the intended outcome.

Rich Mogull
January 09, 2025 • Estimated Reading Time: 15 minutes

Prerequisites

Completed our RCP lab
Ideally you have the AWS Organization we used for these labs. Worst case you can use another org, but you will want to adjust this lab sequence for your environment.

The Lesson

This lab is the start of an extended sequence I’m calling Advanced Cloud Security Problem Solving. The objective is to show you specific skills, and help you understand how to think your way through a problem. These labs are meant to operate in order, and do not stand alone. How many labs are in this sequence? I have 8 mapped out, but it might vary as I get into actually building them all out.

Well, I made another oopsie in a lab. Two, actually, in the same lab. I’ve known about these for a while but they didn’t break anything currently running, and the real fix to meet the same objective required a solid block of lab time to get into some deep cloud mystery.

So what happened? Well, the core problem is that I was rushing the Resource Control Policy lab since it was a busy week, and I applied and tested it incorrectly. So I didn’t see the errors until friend-of-the-SLAW Chris Farris went to test it before our AWS re:Invent session and realized it broke. There were two issues in the JSON of that RCP:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyAccessToSensitiveData",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": "*",
      "Condition": {
        "StringEqualsIfExists": {
          "aws:ResourceTag/classification": "sensitive"
        },
        "StringNotEqualsIfExists": {
          "aws:SourceOrgID": "my-org-id"
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false"
        }
      }
    }
  ]
}

You cannot use bucket tags in a condition. I knew this, I forgot this, and I messed up. The ResourceTag statement is ignored and that condition is never applied. Tags in S3 only work in IAM conditions at the object level, which… is not ideal.
aws:SourceOrgID is wrong, I should have used was:PrincipalOrgID. SourceOrgID is for when two different services are talking to each other across accounts in your org. We wanted to evaluate the calling identity, which is the Principal. This is something I knew, but it was just a quick copy/paste failure.

The end result is that the way we set this up all access to the buckets breaks, and even if we fix the SourceOrg problem, the condition key for the tag is never evaluated so access is open.

This is a limit of RCPs: you can’t manage access based on bucket tags!!!

Look, as long as I’ve been doing this I still learn every day, and forget just as much every night when I lay my weary head to rest. But this is a great opportunity to go deep into security automation and guardrails.

Blah! We’ll just do it the hard way!

Just because I can’t write an RCP to achieve our desired outcome doesn’t mean I cannot achieve it. I’ve been building security automations since before AWS Organizations even existed, never mind SCPs and all the other cool tools we have now. Heck, I even founded a security automation startup that was acquired by FireMon.

As a reminder, security invariants are rules which prevent security issues from happening. Unlike other defenses, an invariant is always true. In other words, there aren’t exceptions or ways around it.

When we look at security invariants there are multiple implementation pathways which are typically usable in every provider:

Infrastructure as Code scanning
Organization-based policies, like SCPs, RCPs, and declarative policies
Configuration settings (what we manage with declarative policies)
Identity-based policies
Automation guardrails (either bring your own or using something built into the provider)

To review our intended invariant, our desired outcome is that if I tag a bucket with a classification of sensitive, it cannot be accessed from outside my organization.

Okay, in our situation we learned that the organizations-based policies can’t help us since we can’t set a condition on a tag. We aren’t locking things down to use IaC only for deployments, and there aren’t configuration settings (like BPA) or identity-based policies that can do what we want. That leaves…

Automation Guardrails

An automation guardrail looks for a condition and then either loops someone in to fix it, or performs an autoremediation. We want to build an invariant, so we will go with autoremediation — that way anytime a bucket is tagged as sensitive, we’ll automatically change its configuration to lock it down to only being accessible from our Organization.

To pull this off we need to:

Identify all buckets tagged sensitive
Identify when a bucket is newly tagged sensitive, or the tag is removed
Determine whether the bucket is accessible from outside the organization
Update the security controls to restrict access to our organization
Track any configuration changes (e.g., bucket policy updates) to ensure access isn’t opened to outside our organization.

There is a lot of nuance in those requirements, and this isn’t a simple one and done automation. For example, do we want to allow people to take the sensitive tag off the bucket? If we apply this to an existing account, do we want to slap on our new controls, or check with the owners first? Who is allowed to change the tags?

Also, there are multiple ways to solve automation problems. In some cases we might be able to use AWS Config, or we might want to use a third party tool, or we could want to do it ourselves; but where do we want to run the automation, and who do we want to maintain it?

Over the next batch of labs I’ll walk you through how I think about the problem, and we will implement a common automation pattern I like. This particular desired outcome is a great one because it has just enough built-in complexity to really explore the foundation of automation.

Real Time or Scan Based? YES!

One of the first decisions we need to make is whether our automation will operate in real time as changes are made, or it will scan for a misconfiguration already in place. Personally I try to implement my guardrails as real-time as possible. I really don’t think you can rely on an hourly or daily scan (which is what most tools do) for a true invariant.

In our case this means watching for API calls to know when a bucket is tagged or un-tagged, or its access changes. That said, we also need time-based and triggerable scans, because what if an account is moved into the OU where we set our invariant? Or maybe we missed an API call because even Amazon’s own services can’t guarantee to catch them all?

So for the first block of labs we’ll start with EventBridge. If you recall from our Security Hub lab, EventBridge is the AWS native service for handling events. In our case we consolidated all Security Hub events (including GuardDuty and AccessAnalyzer) into our SecurityAudit account, and we wrote rules to send email in case bad things happen.

But EventBridge can do more. We will use three key features to build the core of our real-time automation:

We can create custom event buses which receive only certain events.
With CloudTrail enabled, we can collect API call events in real time.
We can forward events from one event bus to another, even across accounts.

This is where we’ll start. This week we will fix our broken RCP so it doesn’t get in the way, then we’ll set up 2 custom event buses (one for each region we use). In future labs we’ll learn how to set our rules, push everything out using centralized Infrastructure as Code (StackSets), build Lambda functions, and more!

Remember: there are many ways to reach our desired outcome. I’m taking you down a path I like, but this isn’t the only path I use, and I’ll do my best to tell you why I make each decision as we go.

Key Lesson Points

You cannot use bucket tags as a condition in IAM, SCPs, or RCPs (object tags work).
When you can’t use an SCP/RCP/Declarative Policy, you can often use an autoremediation guardrail to achieve the desired outcome.

The Lab

First we will remove our RCP so it doesn’t break anything later, then we will set up EventBridge in our SecurityOperations account to receive events from any other account in our Organization. We will use this as the central hub to drive the autoremediation we will build in this series of labs.

One cool thing is that this architecture is both flexible and updatable. We will use it in future labs for different kinds of guardrails!

Video Walkthrough

Step-by-Step

Go to your Sign-in portal > CloudSLAW > AdministratorAccess > Organizations > Policies > Resource control policies (yes, that’s like 6 clicks, but you’ve done this a ton already):

Then click S3 Classified Sensitive:

If you wanted to fix this, the JSON below would restrict all access from outside your organization — but on all buckets in the OU/account where the RCP is applied! This is a big hammer and I don’t recommend it, but this “fixes” the RCP by removing the unenforceable condition and using the correct aws:PrincipalOrgID:

You do not need to do this, and even if you do you still need to detach the RCP. This is for reference purposes only!

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyAccessToSensitiveData",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": "*",
      "Condition": {
        "StringNotEqualsIfExists": {
          "aws:PrincipalOrgID": "my-org-id"
        },
        "BoolIfExists": {
          "aws:PrincipalIsAWSService": "false"
        }
      }
    }
  ]
}

Instead we will detach and then delete the policy (if you want to play with it feel free to do something different, but you’re on your own).

Click the radio button and then Detach for each OU:

Then repeat for Workloads

When Targets is empty, click Delete and confirm.

For this next bit we need to collect a little information and fill in a block of JSON as we go. This is the resource policy that we will apply to the new Event Bus we are creating. It says anything in the same org can publish to this Event Bus (once we create it). I’ll explain as we go, but as long as we are in Orgs, let’s paste in the Organization ID. Paste this JSON into a text editor, and paste in your Organization ID:

The JSON to paste and fill in:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAllAccountsInOrganizationToPutEvents",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "events:PutEvents",
      "Resource": "arn:aws:events:us-east-1:xxxxxxxxxxxx:event-bus/SecurityAutomation",
      "Condition": {
        "StringEquals": {
          "aws:PrincipalOrgID": "o-xxxxxxxx"
        }
      }
    }
  ]
}

Like this:

Now close the browser tab for CloudSLAW and SecurityOperations > AdministratorAccess:

Okay, here’s how and why I decided to use the SecurityOperations account. The pattern I tend to follow is that my monitoring-only tools (and analysis tools) go into SecurityAudit. This includes Security Hub, log analysis tools, etc. SecurityOperations is for tools and people who make changes. In my case, this is often also the account where I run incident response.

There are alternate approaches. For example Security Operations could be for general security tools such as antimalware, firewall management, etc. Then you might have a dedicated automation account and a different incident response account.

All these options are valid. The decision usually comes down to whatever was done before you got there, and/or your team structure. For our purposes SecurityOperations is a good home for any “makes changes” stuff.

We will create something called a custom event bus. Before now we’ve just used the default event bus, which collects all the events in a given region and account. We don’t want to flood the default with events from other accounts, and as you’ll see we will get to choose the events we want to send to this new one we are creating. Like many things, event buses are region-specific, and you can’t send events across regions (easily). If you recall we locked out all our regions except us-east-1 and us-west-2 (N. Virginia and Oregon). So we need to build all our autoremediation in both regions. Mostly we will use CloudFormation to do this, but today we will build them manually since it only takes a few seconds.

Sooo…. Change to N. Virginia > copy your account ID and paste into the JSON:

You are in N. Virginia now, right? You sure? Okay, let’s keep going…

EventBridge > Event bus > Create event bus:

Name it SecurityAutomation and provide a description of Event bus for security operations automations. Thank click Load template:

Now paste in your template (over the one that’s there). Remember to make sure:

You are in N. Virginia
You pasted in the SecurityOperations account ID (current account)
You are naming your event bus SecurityAutomation
You pasted in your Organizations ID

If you think back to when we first discussed bucket policies, I mentioned that resources which can be accessed from outside your account nearly always support a resource policy. We use this one to ensure only accounts within our organization can send this event bus events. You can open it up to basically anyone, and that isn’t necessarily a security concern. Over at FireMon Cloud Defense we have a (mostly) open event bus we use to collect customer events. Yes, we use the same technique at commercial scale.

Scroll to the bottom and click Create.

Okay, now we need to follow the exact same process for Oregon. First change your region to Oregon:

Before you paste in the template, you need to change the ARN for the event bus to reflect the different region. Change us-east-1 to us-west-2:

Now repeat the steps to create an event bus in Oregon.

Sorry, but you don’t get new screenshots. Use the same name and description, and the new template, then Create.

And that’s it! We are all set to receive events, so you can probably guess what some of our next labs will be.

Lab Key Points

EventBridge comes with a default event bus for every region of every account, but we can also create dedicated custom event buses.
Event buses support resource policies,
Event buses are region-specific. To collect events in different regions you need to create a custom event bus in each region (and later we’ll also duplicate EventBridge rules, Lambda functions, etc.)

-Rich

Reply

or to participate.