- Cloud Security Lab a Week (S.L.A.W)
- Posts
- Schedule Security Scanning with a Serverless Fanout Pattern
Schedule Security Scanning with a Serverless Fanout Pattern
Running scheduled scans at scale can be challenging, but we'll knock it out in mere seconds with this cool serverless design pattern.
CloudSLAW is, and always will be, free. But to help cover costs and keep the content up to date we have an optional Patreon. For $10 per month you get access to our support Discord, office hours, exclusive subscriber content (not labs — those are free — just extras) and more. Check it out!
Prerequisites
This is part 9 (the final lab!) of the Advanced Cloud Security Problem Solving series, which I’ve been calling Epic Automation. If you haven’t completed parts 1-6 yet, jump back to the first post in the series and complete all prior posts before trying this one.
The Lesson
This is, I swear, the absolute last lab in our Advanced Cloud Security Problem Solving series! (Epic Automation fit the title card better).
By now you’ve probably realized that I, uh, maybe pulled a fast one on you. It seems we’ve spent these 9 labs learning as much about development and cloud architecture patterns as about cloud security. The thing is, these two skills are impossible to separate if you want to work as a cloud security professional. Just as a firewall engineer needs to understand network topologies, technologies, and how to interpret packet captures… on the cloud side we need to understand our own fair share of non-security fundamentals.
Today’s lab is a very cool one, which I was excited to get up and running. If you recall, way back in the first lab of this series, we defined our objective: “if a bucket is tagged sensitive, only allow access from within my organization.” To do this we spent 8 labs building out an event-driven architecture to would assess any buckets as their configuration changed, and then enforce our desired settings. The diagram looked like this:

So far we’ve built out basically everything except that teeny-tiny little blue box on the left side that says “Time: Every Hour”. And, uh, that little blue box contains a big ol’ can of worms.
Scaling with a Lambda Fanout Pattern
In our little environment we only have one S3 bucket to care about, so running a time-based scan isn’t a big deal. But imagine an organization with many thousands of buckets spread across hundreds of accounts (which I personally have — never mind a big enterprise, a factor of 10 larger than that). Running a scan at that scale, across multiple accounts, would be both massively time consuming and incredibly inefficient.
We don’t call AWS a “hyperscale cloud provider” for nothing!
What we have already built is an event-driven architecture pattern that only runs as things change. No changes? Nothing runs. Modify 10 buckets at once? We invoke 10 lambda functions in parallel. This architecture is incredibly scalable and efficient, and nicely shows off the power of on-demand cloud computing. What makes it so efficient is that we know exactly which bucket changed, and our autoremediation lambda only has to look at that one bucket, and only when it changes.
However, this doesn’t work for time-based scans since we don’t have an event that identifies a changed bucket, so they need to look at every single bucket.
The good news is we have other patterns we can use, and the one we will cover today is a lambda fanout. We will use a series of 3 different lambda functions to identify and then scan every bucket, basically in parallel. Each lambda can trigger other lambdas in a fan pattern to run as much in parallel as possible. Here’s how it works:
We run an EventBridge Rule which triggers our first lambda every 12 hours (I picked this instead of every hour to save costs).
That lambda, called enumerate-accounts, takes an environment variable specifying which OU to start in. This is just our Production OU because we defined that as the only place where we want to enforce isolation. enumerate-accounts does two things:
It identifies any child OUs. For each child OU it invokes another version of itself. This continues to fan out until there are no more child OUs. Each of these is a separate invocation running a new copy of the lambda, so it’s incredibly fast.
It identifies every account in the current OU. For each account, it invokes the enumerate-tagged-buckets lambda function. So if it finds 37 accounts, enumerate-tagged-buckets runs 37 copies in parallel.
Our second lambda, enumerate-tagged-buckets, uses a very cool little feature: Resource Group Tagging. Instead of having to scan every bucket to find tagged buckets, we can find all tagged buckets with a single API call! This dramatically improves efficiency and, I hate to admit, I had never used it before this lab.
This creates a list of all buckets tagged with a key of “classification” and value of “sensitive”.
We iterate over that list, and invoke our already-created security-auto-s3 lambda function once for each bucket. This is the third phase of our fanout and all of these run in parallel.
We directly invoke the function; we don’t need to create a new EventBridge event to trigger it. However, we create a fake event with the account ID and bucket name so we don’t need to update our existing code!
security-auto-s3 works exactly the same as if we triggered it from EventBridge, thanks to passing in the synthetic event with the fields it needs to find the bucket.
Even in a large environment this should run in under 30 seconds.
Cool, eh? This architecture allows us to run a time-based scan with a bunch of parallel resources which don’t even run until the clock triggers them. Then they fan out and invoke the next layer of the stack, as efficiently as possible. Here’s what our 3-tier fanout architecture looks like:

But does it really scale? What else do I need to know?
While this pattern scales for even massive environments, this is about the simplest possible implementation, and I would modify it if I was operating it in an enterprise:
There is no error handling outside some basics in the functions. And you won’t see those if you aren’t checking the logs.
At a large enough scale you might hit service limits (AWS limits the number of simultaneous API calls on a service), and we don’t handle these errors or attempt retries (well, until 12 hours later).
There is no state management, which is one of the main ways we manage errors in event-driven architectures.
We don’t track or log changes, outside of what you find in CloudTrail logs.
All those are manageable problems, using tools we mix and match like EventBridge, DynamoDB (or ElasticSearch), Simple Queue Service, and Simple Notification Service; and we may get to those someday.
Or just go buy a commercial tool. This is very similar to how our FireMon Cloud Defense platform operates, and most of you in multicloud enterprise environments should strongly consider buying vs. building (and you probably already have something).
Key Lesson Points
Scheduled scans are more difficult than event-driven triggers since you need to manage scale and respect service limits.
A lambda “fanout” pattern scales nearly instantly by running multiple, parallel invocations, which directly match the resources you have.
Fanouts work well with different tiers to match scaling requirements.
But don’t forget you might need error handling and state management when using this in an important environment.
The Lab
This would be a lot to build out completely by hand, so I packaged it all up into a CloudFormation template. There are still a few steps to get it up and running, and I’m including all the code here in the blog post so you can review each piece. We have a few steps to make this work:
We need to delegate administration for AWS Organizations to our SecurityOperations account, so it can read the accounts and OUs. As you will see, we use a restrictive resource policy for only the minimal API calls required.
We must deploy the main template, which loads the 2 new lambda functions from my public S3 bucket; we also create 2 new roles and the EventBridge Rule which runs every 12 hours.
We finish by updating our SecurityAutoremediation role using an updated template in CloudFormation StackSets. This step must occur last, since it allows our new enumerate-tagged-buckets lambda to use the existing role. If you recall, we locked this down in a previous lab so only allowed lambda functions can use the role.
We also add permission to use the tag:GetResources API to find tagged buckets.
Since all this deploys at once, here is the code for the enumerate-accounts lambda function which finds all accounts in the current OU, then runs another version of itself if there are any child OUs. This is the first step of our fanout:
import os
import boto3
org_client = boto3.client('organizations')
lambda_client = boto3.client('lambda')
def lambda_handler(event, context):
production_ou = os.environ['production_ou']
account_ids = []
# Recursively collect all account IDs under the given OU
def collect_accounts(parent_ou):
# List accounts directly under this OU
paginator = org_client.get_paginator('list_accounts_for_parent')
for page in paginator.paginate(ParentId=parent_ou):
for acct in page['Accounts']:
account_ids.append(acct['Id'])
# List child OUs and recurse
ou_paginator = org_client.get_paginator('list_organizational_units_for_parent')
for ou_page in ou_paginator.paginate(ParentId=parent_ou):
for ou in ou_page['OrganizationalUnits']:
collect_accounts(ou['Id'])
collect_accounts(production_ou)
# Fan-out: invoke 'find-sensitive-buckets' Lambda for each account
for account_id in account_ids:
lambda_client.invoke(
FunctionName='enumerate-tagged-buckets',
InvocationType='Event', # async
Payload=f'{{"account_id": "{account_id}"}}'
)
return {
'statusCode': 200,
'body': f'Invoked find-sensitive-buckets for {len(account_ids)} accounts.'
}
And here’s the code for enumerate-tagged-buckets which finds all tagged buckets, and then runs security-auto-s3 on each of them. This is the second stage of the fanout, and the security-auto-s3 is the third stage. Look for the tag:GetResources command — that’s the magic which saves us from having to list all buckets and then look for all the tagged ones:
import boto3
import os
import json
REGIONS = ["us-east-1", "us-west-2"]
def assume_role(account_id, region):
sts = boto3.client('sts')
role_arn = f"arn:aws:iam::{account_id}:role/SecurityOperations/SecurityAutoremediation"
response = sts.assume_role(
RoleArn=role_arn,
RoleSessionName="FindSensitiveBucketsSession"
)
creds = response['Credentials']
return boto3.client(
'resourcegroupstaggingapi',
region_name=region,
aws_access_key_id=creds['AccessKeyId'],
aws_secret_access_key=creds['SecretAccessKey'],
aws_session_token=creds['SessionToken']
)
def lambda_handler(event, context):
account_id = event['account_id']
lambda_client = boto3.client('lambda')
found_buckets = set()
for region in REGIONS:
tagging_client = assume_role(account_id, region)
paginator = tagging_client.get_paginator('get_resources')
for page in paginator.paginate(
ResourceTypeFilters=['s3'],
TagFilters=[{'Key': 'classification', 'Values': ['sensitive']}]
):
for resource in page['ResourceTagMappingList']:
arn = resource['ResourceARN']
# S3 bucket ARN format: arn:aws:s3:::bucket-name
bucket_name = arn.split(":::")[-1]
if bucket_name not in found_buckets:
found_buckets.add(bucket_name)
# Mimic a CloudTrail CreateBucket event (minimal fields)
payload = {
"account": account_id,
"detail": {
"resources": [
{
"type": "AWS::S3::Bucket",
"ARN": arn
}
],
"requestParameters": {
"bucketName": bucket_name
},
"eventSource": "s3.amazonaws.com",
"eventName": "CreateBucket",
"recipientAccountId": account_id,
"accountId": account_id
}
}
lambda_client.invoke(
FunctionName="security-auto-S3",
InvocationType="Event", # async
Payload=json.dumps(payload)
)
return {
"statusCode": 200,
"body": f"Invoked S3-secure for {len(found_buckets)} sensitive buckets in account {account_id}."
}
Okay, let’s jump into the lab:
Video Walkthrough
Step-by-Step
Start in your Sign-in portal > CloudSLAW > AdministratorAccess > Organizations. Copy/Paste the Workloads > Prod OU ID and the SecurityOperations Account ID. You will need both.


Now we need to delegate administration for Organizations to our SecurityOperations account. Still with Organizations > Settings > Delegate administrator for AWS services > Delegate:

Then copy this JSON and paste into the window > Replace ACCOUNTID with your SecurityOperations account ID:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Statement",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::ACCOUNTID:root"},
"Action": [
"organizations:ListAccountsForParent",
"organizations:ListOrganizationalUnitsForParent"
],
"Resource": [
"*"
]
}
]
}

Notice that we only include 2 permissions. This is all our code needs (for now) to find child accounts and OUs when given a parent. We can walk down the tree, but we can’t climb up. Maybe we’ll change that later, maybe not. I’m kind of whimsical that way.
Close the tab > Sign-in Portal > SecurityOperations > AdministratorAccess > CloudFormation > Create stack:

Specify template https://cloudslaw.s3-us-west-2.amazonaws.com/lab57.template and name it S3Scanner > paste your Prod OU > click through the rest and Create stack:


Let the stack deploy completely!!!!
This creates:
A role for each lambda function
The enumerate-accounts lambda function
With the Prod OU ID as an environment variable. This defines the “top” of the tree to scan.
The enumerate-tagged-buckets lambda function
The EventBridge rule to run every 12 hours
Now we must wait because our next step will add permission for the enumerate-tagged-buckets lambda function to use our role in the target accounts we created for security-auto-s3. Remember we locked that down so only allowed lambda functions could use it? Yep, time to … uh … allow it.
Did I waste enough time? Good, now we need to go into CloudFormation > StackSets > AutoremediationRole > copy the OU ID:

Then go to Actions > Edit StackSet details:

Replace current template > paste this URL https://cloudslaw.s3-us-west-2.amazonaws.com/lab57-role.template > Next

This update adds permission for the tag:GetResources action, and allows our new lambda to use the role.
Click Next on the page with the name, then Paste in the OU ID you just copied > Add all regions > Next > Submit (which is on the next page):

That should roll out quickly; then it’s time to test!
To test we will run a manual test and review the logs. Go to Lambda > Functions > enumerate-accounts:

Then Test > Test (we don’t need a test event because this is triggered by the clock and doesn’t need any input):

This enumerates all the OUs and accounts, then triggers the enumerate-tagged-buckets function once for every account, then that finds all buckets tagged “sensitive” and runs security-auto-s3. So to see if it works we can skip to the end and check the logs.
Go to CloudWatch > Log Groups > /aws/lambda/security-auto-s3. Click the most recent one (which should match the current day) and you should see it ran successfully. This thing is so lightning fast that if you don’t see logs by the time you read all this and click around, odds are something went wrong so, uh, try again? (Or ask me for help):


Lab Key Points
You can find all tagged resources without enumerating them individually, using tag:GetResources.
Before updating permissions you need to create the entity that will get the permissions; otherwise AWS will error out.
We can manually synthesize an event with the same structure as a CloudTrail event to trigger our existing function without modification.
-Rich
Reply