Nov 01, 2025

Processing over 50 Million Documents with AWS Serverless: An Intelligent Document Processing Success Story

Morten Jensen

11 minutes read

Intelligent Document Processing with AWS Serverless

Executive Summary

We helped a UK financial services customer build an Intelligent Document Processing solution that successfully processed more than 50 million documents in just a few weeks. The entire system went from initial concept to production in under 4 months using AWS serverless services.

Business Impact:

Unlocked critical business data in weeks, that was previously inaccessible
Enabled rapid decision-making by making historical document insights immediately available to downstream business systems
Reduced time-to-value through iterative delivery - the customer started seeing results after just 2 months with Stage 1, rather than waiting for a complete solution
Achieved predictable, linear costs that scaled with usage, avoiding large upfront infrastructure investments

Technical Results:

50+ million documents processed end-to-end with text extraction, ML classification and AI-powered content extraction
4 months from concept to production across three iterative stages
Serverless architecture that automatically scaled from zero to thousands of documents per minute and back to zero
Zero operational overhead for infrastructure management, allowing the team to focus on business logic

Key Technical Learnings:

SQS over Step Functions: For high-volume workloads (50M+ documents), SQS provided better cost, concurrency control and natural backpressure
Central DLQ pattern: Single Dead Letter Queue with unified error handling simplified operations and monitoring across all stages
DynamoDB GSI partitioning: Random partition IDs (0-29) prevented hot partitions and throttling during high-volume processing
Incremental delivery: Breaking the solution into three stages allowed us to deliver value early whilst de-risking the overall implementation

Interested in building a similar solution? We’d love to hear about your document processing challenges:

Not ready to fill out a form? Explore our AWS Serverless Consulting Services or see how we help with AI/ML Foundation Services.

Background

Recently we worked with a UK financial services customer to build an Intelligent Document Processing (IDP) solution that successfully processed more than 50 million documents in just a few weeks. The solution was built entirely on AWS serverless services and went from initial architecture to a production-ready system in under 4 months.

The customer needed to extract text, classify documents and pull out key information from a massive backlog of historical documents whilst continuing to process new documents arriving daily from third-party systems.

Speed was critical. The customer needed results quickly and we needed to iterate fast to add incremental value at each stage.

This blog post describes the architecture we built, the design decisions we made and why AWS serverless services were the perfect fit for this use case.

The Challenge

The customer had accumulated tens of millions of documents over several years. Each document needed:

Text extracted and stored
Classification by document type using machine learning
Content extracted intelligently from the document

The documents arrived from multiple source systems via a transfer service that uploaded files to an S3 bucket. From there, we needed to build a robust, scalable workflow that could handle the initial backlog whilst processing new documents as they arrived.

Why Serverless?

We chose AWS serverless services for several reasons:

Pay-per-use: No need to provision or pay for long-running, idle infrastructure
Auto-scaling: Automatic scaling to handle variable workloads from zero to thousands of documents per minute
Reduced operational overhead: No servers to patch or maintain
Speed to market: Focus on business logic rather than infrastructure management
Cost-effective at scale: Particularly when using services like SQS and Lambda with pay-per-request pricing

The Architecture

We designed the solution in three stages, with each stage building on the previous one. This allowed us to deliver value incrementally whilst de-risking the overall implementation.

IDP Architecture Diagram

Stage 1: Text Extraction and Metadata Storage

Timeline: 2 months from concept to production

The first stage established the foundation:

Transfer service collects documents from multiple source systems and uploads to an S3 bucket
S3 Event Notification triggers and sends message to an SQS queue
Lambda function processes messages from the queue and extracts document text and metadata, then stores contents and metadata stored in DynamoDB
Failed messages route to a central Dead Letter Queue (DLQ)
Separate DLQ processor Lambda adds error message and status to DynamoDB

This stage was critical to get right. We needed to ensure that every document was tracked, that failures were captured and that we had full visibility into the processing workflow.

Stage 2: Document Classification

Timeline: 1 additional month

The second stage added intelligent classification:

Stage 1 Lambda emits message to Stage 2 SQS queue upon successful processing
Classification Lambda processes messages from the queue and invokes a SageMaker endpoint hosting a fine-tuned BERT model
SageMaker endpoint configured with auto-scaling to handle variable load
Classification results written back to DynamoDB as new columns
Failed messages route to the same central DLQ

For historical documents already processed by Stage 1, we built a Glue job that hydrated the Stage 2 queue with messages for documents that needed classification.

The use of SageMaker allowed us to deploy a custom model that had been fine-tuned on the customer’s specific document types. Auto-scaling ensured we could handle peak loads whilst keeping costs down during quieter periods.

Stage 3: Content Extraction

Timeline: 1 additional month

The third stage added semantic understanding:

Stage 2 Lambda emits message to Stage 3 SQS queue upon successful classification of specific document types
Content extraction Lambda processes messages from the queue and invokes Bedrock with Claude Haiku model
Claude extracts relevant content from documents
Extracted content written back to DynamoDB as new columns
Failed messages route to the same central DLQ

Critically, only documents classified as specific document types in Stage 2 progressed to Stage 3. This filtering ensured that content extraction was only performed on documents where it would add value. The Glue hydration job similarly queued only correctly classified documents to the Stage 3 queue.

Using Bedrock with Claude Haiku provided excellent results for content extraction. The model’s ability to understand context and extract structured information from unstructured text was exactly what we needed.

Deployment Pipeline Architecture

Beyond the document processing workflow itself, we implemented a CI/CD pipeline using AWS CDK and CodePipeline to manage deployments across multiple environments and AWS accounts.

Deployment Pipeline Diagram

Multi-Account Strategy

We adopted a multi-account AWS strategy following best practices:

Shared Services Account: Hosted the CodeCommit repository, CDK pipeline (CodePipeline) and cross-account deployment orchestration
NonProd Account: Contained both Dev and UAT environments for development and user acceptance testing
Prod Account: Hosted the production environment with appropriate guardrails and separation

Pipeline Flow

The CDK pipeline automatically triggered on commits to the CodeCommit repository and followed this stack deployment sequence per environment:

KMS Stack: Customer-managed KMS keys for encrypting data at rest
Storage Stack: S3 buckets (with encryption) and DynamoDB table with three GSIs
Processing Stack: All Lambda functions, SQS queues, DLQ, Glue jobs and IAM roles/policies
Dashboard Stack: CloudWatch dashboards providing visibility into processing metrics and system health

This modular stack design allowed us to update individual components independently whilst maintaining clear dependencies between infrastructure layers.

A Separate CloudFormation stack deployed and ran integration tests in the Dev environment:

Tests validated the complete workflow end-to-end with sample documents
Lambda function, integrated into the pipeline, in the Shared Services account performed cross-account checks
Pipeline only proceeded if all tests passed; otherwise the deployment failed

Final deployment to production following successful UAT validation and through continuous delivery, required a manual approval gate to proceed.

Integration Testing

Integration testing was particularly important for ensuring quality.

Integration test stack created test S3 objects, triggered the workflow and validated results

This approach gave us confidence that each deployment was fully functional before promoting to the next environment.

Key Design Decisions

SQS vs Step Functions

One question that often arises when building serverless workflows is whether to use SQS or Step Functions for orchestration.

We chose SQS for several reasons:

Concurrency control: SQS allows fine-grained control over Lambda concurrency between stages. This is critical when downstream services (like SageMaker endpoints or Bedrock) have concurrent execution constraints.

Cost at scale: For high-volume workloads, SQS is significantly cheaper than Step Functions. At 50 million documents, with each document going through three stages, we’re looking at a considerable amount of state transitions.

Simplicity: Each stage is independent and can be developed, tested and deployed separately. There’s no need for complex state machine definitions.

Natural backpressure: SQS provides natural backpressure when downstream systems are under load. Messages remain in the queue until they can be processed.

Replay capability: The Glue hydration jobs gave us the ability to replay processing for any stage, which proved invaluable during development and testing.

The main disadvantage of SQS compared to Step Functions is the lack of built-in visibility into the overall workflow state of a given document. However, we addressed this by storing comprehensive status information in DynamoDB and building CloudWatch dashboards to monitor progress.

Central Dead Letter Queue

Rather than having separate DLQs for each stage, we implemented a single central DLQ with a single Lambda function to handle all failures.

This approach had several benefits:

Simplified operations: One place to monitor for failures across all stages
Consistent error handling: The same error handling logic applied regardless of which stage failed
Better visibility: CloudWatch metrics and alarms could focus on a single DLQ depth metric

The DLQ processor Lambda was responsible for parsing the failed message, determining which stage it came from and updating the DynamoDB record with appropriate error information and status.

DynamoDB Table Design and GSI Strategy

We used a single DynamoDB table to store all document metadata and processing results. Each stage added its own unique columns to the table with no overlap beyond the initial Stage 1 metadata. This design allowed us to track the complete processing lifecycle of each document in one place.

To address read-on-partial-write scenarios and enable efficient querying of completed documents at each stage, we implemented three Global Secondary Indexes (GSIs):

Stage 1 GSI: Indexed on Stage 1 completion date
Stage 2 GSI: Indexed on Stage 2 completion date
Stage 3 GSI: Indexed on Stage 3 completion date

Each completion date column was only written when a document successfully completed that stage. This meant we could query for all documents completed at a specific stage without returning partially processed documents.

Critically, to mitigate the risk of hot DynamoDB partitions and throttling, we partitioned each GSI using a random partition ID between 0 and 29. This spread the write load across 30 partitions per date, preventing any single partition from becoming a bottleneck during high-volume processing. Downstream processes would fan out 30 queries and merge the results.

Observability with PowerTools and X-Ray

We used AWS Lambda PowerTools throughout the solution to provide structured logging, tracing and metrics.

PowerTools gave us:

Structured logging: JSON-formatted logs that could be easily queried in CloudWatch Logs Insights
Distributed tracing: Automatic X-Ray tracing (sampling) to help troubleshoot any bottlenecks
Custom metrics: Application-level metrics pushed to CloudWatch without impacting performance

X-Ray sampling was configured to capture a representative sample of traces without incurring excessive costs.

This level of observability proved invaluable for troubleshooting issues and understanding system behaviour under load.

The Results

The solution successfully processed more than 50 million documents end-to-end in just a few weeks once all three stages were deployed.

The customer was able to:

Access extracted text and metadata for all documents
Integrate the processed data into downstream business systems
Continue processing new documents as they arrived with minimal latency

The serverless architecture meant that costs remained predictable and scaled linearly with usage. During the initial backlog processing, the system automatically scaled up to handle thousands of documents per minute. Once the backlog was cleared, it scaled back down to handle the steady-state load of new documents.

Lessons Learnt

Building an IDP solution at this scale taught us several valuable lessons:

Start simple, iterate quickly: By breaking the solution into three stages, we were able to deliver value early and iterate based on real-world usage patterns.

Serverless is perfect for batch workloads: The ability to scale from zero to hundreds of concurrent executions and back to zero made serverless ideal for processing the historical backlog.

Observability is not optional: PowerTools and X-Ray were essential for understanding system behaviour and troubleshooting issues at scale.

SQS provides excellent value: For high-volume workloads, SQS offers the best combination of cost, performance and operational simplicity.

DynamoDB single-table design works well: Storing all document metadata, processing status and results in a single DynamoDB table simplified the architecture and reduced costs compared to other storage options.

Plan for failure: The central DLQ and comprehensive error handling meant that no documents were lost, even when individual processing steps failed.

Conclusion

Building an Intelligent Document Processing solution doesn’t require complex infrastructure or months of development time. By leveraging AWS serverless services, we were able to go from initial architecture to a production system processing 10’s of millions of documents in under 4 months.

The key was to start simple, deliver value incrementally in collaboration with the customer and use managed services wherever possible. The combination of S3, SQS, Lambda, DynamoDB, SageMaker and Bedrock provided a robust, scalable and cost-effective foundation for processing more than 50 million documents.

If you’re facing a similar challenge and need help building serverless solutions on AWS, Virtuability can help. We specialise in AWS serverless architectures, AI/ML integration and delivering business value quickly through iterative development.

Visit us at https://www.virtuability.com to learn more about how we help businesses leverage AWS to meet their objectives.