Jul 11, 2024
Querying AWS CloudTrail Logs with Athena in AWS Organizations: Setup, Use and Challenges
Background
Over the years, we have helped organisations make sense of their CloudTrail logs when the need arises to query for specific events or to aggregate and report on the data. Our tool of choice is generally Amazon Athena to accomplish this.
It is quick to set-up and will have you running queries in minutes.
Amazon Athena provides a serverless, interactive query service that enables users to analyse large amounts of data stored in Amazon Simple Storage Service (S3) buckets using Structured Query Language (SQL).
One of Athena’s most potent features is partition projection, which can be particularly useful when working with AWS CloudTrail logs across an AWS Organization’s trail.
In this blog post we describe how to set-up Athena and query the logs while also pointing out some caveats when using Athena.
Understanding Athena Partition Projection
Partition projection in Athena is a method to speed up query processing and automate partition management for highly partitioned tables. Instead of performing time-consuming and resource intensive metadata scanning - via AWS Glue - and lookups in the AWS Glue Data Catalog, Athena calculates partition values and locations using the table properties that you configure and the predicates (where clause conditions) that you provide in the queries. This approach can significantly reduce overheads and query runtime.
How to Use Partition Projection with CloudTrail Logs
To leverage Athena partition projection for CloudTrail logs, you will need to:
- Configure Athena Prerequisites: Set-up a query result location in an Amazon S3 bucket
- Create an External Table: Define your CloudTrail table in Athena, specifying the schema that matches your CloudTrail logs and the S3 bucket location where they are stored.
- Configure Partition Projection: Set up partition projection properties for your table in AWS Glue, defining the ranges of partition values and projection types for each partition column.
- Query Your Logs: Run SQL queries against your CloudTrail logs using Athena, benefiting from the reduced query runtime and automated partition management.
This post will show you how for an AWS Organization trail.
Finding the CloudTrail
In an AWS Organization the organisational CloudTrail is defined in the AWS management account (the payer account). When using Control Tower, the CloudTrail is defined out-of-the box to save events in JSON format to an S3 bucket called aws-controltower-logs-<Log Archive account id>-<Control Tower home region>
in the Log Archive account.
This account is typically also where we define the Athena table to query the data.
Configuring Athena Prerequisites
Athena requires an S3 bucket in order to run queries. For instance, by creating an S3 bucket with a name such as athena-query-results-<account id>-<region>
and configure Athena to use this bucket from the Athena Console prior to attempting to run the first query.
Defining the Table
The AWS-provided Athena documentation already describes parts of the puzzle to define the table on the page Querying AWS CloudTrail logs.
However, the documentation does not show how to set-up partition projection for an AWS Organizations CloudTrail.
The following Data Definition Language (DDL) snippet provides a working table definition with the caveat that the following values must be replaced with real AWS Organization values:
123456789012
: The account id where the CloudTrail S3 bucket existseu-west-1
: The region where the CloudTrail S3 bucket existso-123456789a
: The Organizations id of the AWS Organization
CREATE EXTERNAL TABLE cloudtrail_logs(
eventVersion STRING,
userIdentity STRUCT<
type: STRING,
principalId: STRING,
arn: STRING,
accountId: STRING,
invokedBy: STRING,
accessKeyId: STRING,
userName: STRING,
sessionContext: STRUCT<
attributes: STRUCT<
mfaAuthenticated: STRING,
creationDate: STRING>,
sessionIssuer: STRUCT<
type: STRING,
principalId: STRING,
arn: STRING,
accountId: STRING,
userName: STRING>,
ec2RoleDelivery:string,
webIdFederationData: STRUCT<
federatedProvider: STRING,
attributes: map<string,string>
>
>
>,
eventTime STRING,
eventSource STRING,
eventName STRING,
awsRegion STRING,
sourceIpAddress STRING,
userAgent STRING,
errorCode STRING,
errorMessage STRING,
requestparameters STRING,
responseelements STRING,
additionaleventdata STRING,
requestId STRING,
eventId STRING,
readOnly STRING,
resources ARRAY<STRUCT<
arn: STRING,
accountId: STRING,
type: STRING>>,
eventType STRING,
apiVersion STRING,
recipientAccountId STRING,
serviceEventDetails STRING,
sharedEventID STRING,
vpcendpointid STRING,
eventCategory STRING,
tlsDetails struct<
tlsVersion:string,
cipherSuite:string,
clientProvidedHostHeader:string>
)
PARTITIONED BY (
`origin_region` string,
`origin_account` string,
`delivered_timestamp` string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://aws-controltower-logs-123456789012-eu-west-1/o-123456789a/AWSLogs/o-123456789a/'
TBLPROPERTIES (
'projection.enabled'='true',
'projection.origin_account.type'='injected',
'projection.origin_region.type'='injected',
'projection.delivered_timestamp.type'='date',
'projection.delivered_timestamp.format'='yyyy/MM/dd',
'projection.delivered_timestamp.interval'='1',
'projection.delivered_timestamp.interval.unit'='DAYS',
'projection.delivered_timestamp.range'='2023/01/01,NOW',
'storage.location.template'='s3://aws-controltower-logs-123456789012-eu-west-1/o-123456789a/AWSLogs/o-123456789a/${origin_account}/CloudTrail/${origin_region}/${delivered_timestamp}')
The table is partitioned by 3 columns:
origin_region
: The origin region of the eventsorigin_account
: The origin account id of the eventsdelivered_timestamp
: The timestamp of the CloudTrail S3 object
Testing the Table
The following query tests that the table does indeed work and returns one row. Note that we chose a partition type of injected
for the origin_region
and origin_account
columns and as a result we must include both columns in the query (or we receive a CONSTRAINT_VIOLATION
error).
select *
from cloudtrail_logs
where origin_account in ('123456789012')
and origin_region in ('eu-west-1')
and delivered_timestamp between '2024/07/10' and '2024/07/10'
limit 1
As will be seen later, it’s possible to make different partition projection choices. Each choice has its own pros and cons. The choice of injected
was made for simplicity, i.e. no need to provide a list of possible values in the table definition.
Constraints and Issues
While partition projection offers many benefits, there are some constraints and issues to be aware of. These are discussed in the following sections.
Pick the JSON SerDe with care
Athena can use several serializer/deserializer’s to process JSON data. There are in fact 3 possible SerDe’s that can be used for CloudTrail logs.
The observant reader may have noticed that the SerDe used in the above table definition example (org.openx.data.jsonserde.JsonSerDe
) is not the SerDe suggested in the Athena CloudTrail documentation.
Over the years, Athena JSON SerDe’s and CloudTrail Logs have suffered from various issues that have caused errors during queries or incomplete results when querying the logs.
Identifying that there is an issue in the first place can be challenging when results are returned. Do the rows or counts match actual events - how will you know?
One of the latest problems we encountered was with finding EC2 RunInstances
events over a period. It turned out that only 1/5th of the events were returned. The query failed quietly. We found whole CloudTrail events files with RunInstances
calls that were not included in the result set.
Therefore, trust but verify the results. If you are not sure that the results are correct and you are reasonably confident that the query and table definition is sound, you might try another SerDe to see if this makes a difference. In case of a difference we recommend contacting AWS Support in the first instance.
The three CloudTrail JSON SerDe’s are:
org.apache.hive.hcatalog.data.JsonSerDe
: The AWS recommended one. This was the SerDe quietly failing for us when attempting to queryRunInstances
events, which returned incomplete results (as it turns out, due to the added column definition forwebIdFederationData
- verified with AWS Support)com.amazon.emr.hive.serde.CloudTrailSerde
: A legacy SerDe that according to AWS does not curently support newer CloudTrail columnsorg.openx.data.jsonserde.JsonSerDe
: We generally use this. We are yet to encounter a major issue when querying CloudTrail logs
There are different ways to partition columns
As alluded to previously, there are different ways to project partitions.
In the above example we chose injected
for the origin_account
and origin_region
columns while the delivered_timestamp
column was projected as date
. We did this for DDL definition simplicity. However, the downside to this approach is that both origin_
columns must be included in the predicates in the where clause with either =
or IN
conditions and actual values.
Other partition projection options include:
- Partition the
origin_account
andorigin_region
columns asenum
with'projection.origin_account.values'='123456789012,234567890123,...', 'projection.origin_region.type'='enum', 'projection.origin_region.values'='eu-west-1,eu-west-2,...',
(...
for brevity, see below for how to generate a full list). The downside to this approach is that all column values that you wish to query on must be present in each list. - Partition the
origin_account
column as aninteger
with'projection.origin_account.type'='integer', 'projection.origin_account,range'='000000000000,999999999999', 'projection.origin_account.digits'='12'
. However, the downside to this approach is that the query optimiser will draw incorrect assumptions about continuity of account id’s where there is none and this will affect query performance - and therefore cost - considerably. It is not recommended to do this.
We generally tend to favour the injected
partition projection when we know what AWS accounts and regions to query within the AWS Organization.
When there is a need to query across all accounts or regions, we may, from time to time, use enum
projection.
And sometimes we create tables for both options.
Note You can use the following AWS CLI and jq
commands to generate the AWS accounts and regions lists required for enum partitioning:
# Generate list of active regions
aws account list-regions | jq -rj '.Regions[] | .RegionName + ","'
# Generate list of AWS accounts
aws organizations list-accounts | jq -rj ' .Accounts[] | .Id + ","'
Consider delivery delays when querying by event time.
The delivered_timestamp
is a timestamp annotated to the log file at around the time it’s written to S3. It spans typically minutes of log data. Therefore, if looking for events for a particular time period, it is necessary to expand the delivered_timestamp
to a wider range than the eventTime
, which is the time of the event itself, in order to ensure that all relevant results are returned.
Consider the following query, which uses delivered_timestamp
across 2 consecutive days for partition projection and then queries for a full day of data by eventtime
:
select min(eventtime), max(eventtime)
from cloudtrail_logs
where origin_account in ('123456789012')
and origin_region in ('eu-west-1')
and delivered_timestamp between '2024/07/10' and '2024/07/11'
and eventtime between '2024-07-10T00:00:00Z' and '2024-07-11T00:00:00Z'
Had we instead specified delivered_timestamp='2024/07/10'
we would have likely lost the last 5+ minutes worth of data for the day.
Cost and Time Considerations
Amazon Athena charges based on the amount of data scanned by queries - and so does S3. In addition, the amount of data scanned directly impacts the duration of a query. Therefore, whenever possible include all of the partition projection columns in the query as predicates. If there is a need to query across multiple regions, accounts or days, use the IN
and BETWEEN
conditions accordingly.
Extract event columns from JSON
The format underpinning CloudTrail logs is JSON. When querying the table, the most interesting data - after accounting for event source (AWS service) and event name (API call) - is typically that embedded within the requestparameters
and responseelements
columns, both of which are formatted in JSON.
The two columns contain an extract of the input parameters to an AWS API call as well as an extract of its response. Note that some information - such as secret values - may be absent.
To extract scalar values by JSON path, use the built-in function json_extract_scalar
.
For instance, the following query identifies all non-AWS service AssumeRole calls (including with SAML and Web Identity) that occurred over the past month in 2 regions:
select json_extract_scalar(requestparameters,'$.roleArn') roleArn,
json_extract_scalar(requestparameters,'$.roleSessionName') roleSessionName,
json_extract_scalar(requestparameters,'$.durationSeconds') durationSeconds,
json_extract_scalar(responseelements,'$.assumedRoleUser.arn') assumedRoleUserArn,
json_extract_scalar(responseelements,'$.subject') subject, *
from cloudtrail_logs
where origin_region in ('us-east-1','eu-west-1')
and delivered_timestamp between '2024/07/01' and '2024/08/01'
and eventtime between '2024-07-01T00:00:00Z' and '2024-08-01T00:00:00Z'
and eventsource='sts.amazonaws.com'
and eventname like 'AssumeRole%'
and useridentity.type!='AWSService'
Consider creating views for common queries
Views are great for hiding away complexity and to make queries and joins easier to write (the other option is using the WITH
clause).
Here is an example based on the previous AssumeRole query:
create or replace view user_assume_role as
select json_extract_scalar(requestparameters,'$.roleArn') roleArn,
json_extract_scalar(requestparameters,'$.roleSessionName') roleSessionName,
json_extract_scalar(requestparameters,'$.durationSeconds') durationSeconds,
json_extract_scalar(responseelements,'$.assumedRoleUser.arn') assumedRoleUserArn,
json_extract_scalar(responseelements,'$.subject') subject, *
from cloudtrail_logs
where eventsource='sts.amazonaws.com'
and eventname like 'AssumeRole%'
and useridentity.type!='AWSService'
You may notice that the partition columns are not included. They will be added at query time for maximum flexibility:
select * from user_assume_role
where origin_region in ('us-east-1','eu-west-1')
and delivered_timestamp between '2024/07/01' and '2024/08/01'
and eventtime between '2024-07-01T00:00:00Z' and '2024-08-01T00:00:00Z'
The results of the query using the view will be the same as the query without the view.
The view can now be reused for any purpose.
One additional benefit is that the added view columns (those derived from the JSON columns) can now also be used for conditions such as:
select * from user_assume_role
where origin_region in ('us-east-1','eu-west-1')
and delivered_timestamp between '2024/07/01' and '2024/08/01'
and eventtime between '2024-07-01T00:00:00Z' and '2024-08-01T00:00:00Z'
and subject='someuser@corp.com'
Conclusion
Using Amazon Athena with partition projection to query CloudTrail logs offers an quick way to analyse operational activity across an AWS Organization trail. Getting started requires very little setup and configuration.
By understanding the constraints and optimising for time and cost you can harness the full potential of Athena for your log analysis needs.
Note that because Amazon Athena is a managed, serverless service it is hard to debug problems and will from time to time require AWS Support’s help to troubleshoot a problem. Also, please verify the returned results - various bugs have previously returned empty or incomplete results without warning.