Nov 01, 2025
Processing over 50 Million Documents with AWS Serverless: An Intelligent Document Processing Success Story

Background
Recently we worked with a UK financial services customer to build an Intelligent Document Processing (IDP) solution that successfully processed more than 50 million documents in just a few weeks. The solution was built entirely on AWS serverless services and went from initial architecture to production-ready system in under 4 months.
The customer needed to extract text, classify documents and pull out key information from a massive backlog of historical documents whilst continuing to process new documents arriving daily from third-party systems.
Speed was critical. The customer needed results quickly and we needed to iterate fast to add incremental value at each stage.
This blog post describes the architecture we built, the design decisions we made and why AWS serverless services were the perfect fit for this use case.
The Challenge
The customer had accumulated tens of millions of documents over several years. Each document needed:
- Text extracted and stored
- Classification by document type using machine learning
- Content extracted intelligently from the document
The documents arrived from multiple source systems via a transfer service that uploaded files to an S3 bucket. From there, we needed to build a robust, scalable workflow that could handle the initial backlog whilst processing new documents as they arrived.
Why Serverless?
We chose AWS serverless services for several reasons:
- Pay-per-use: No need to provision or pay for idle infrastructure
- Auto-scaling: Automatic scaling to handle variable workloads from zero to thousands of documents per minute
- Reduced operational overhead: No servers to patch or maintain
- Speed to market: Focus on business logic rather than infrastructure management
- Cost-effective at scale: Particularly when using services like SQS and Lambda with pay-per-request pricing
The Architecture
We designed the solution in three stages, with each stage building on the previous one. This allowed us to deliver value incrementally whilst de-risking the overall implementation.

Stage 1: Text Extraction and Metadata Storage
Timeline: 2 months from architecture to production
The first stage established the foundation:
- Transfer service collects documents from multiple source systems and uploads to an S3 bucket
- S3 Event Notification triggers and sends message to an SQS queue
- Lambda function processes messages from the queue
- Lambda extracts document text and metadata
- Document contents and metadata stored in DynamoDB
- Failed messages route to a central Dead Letter Queue (DLQ)
- Separate DLQ processor Lambda adds error messages and status to DynamoDB
This stage was critical to get right. We needed to ensure that every document was tracked, that failures were captured and that we had full visibility into the processing workflow.
Stage 2: Document Classification
Timeline: 1 additional month
The second stage added intelligent classification:
- Stage 1 Lambda emits message to Stage 2 SQS queue upon successful processing
- Classification Lambda processes messages from the queue
- Lambda invokes a SageMaker endpoint hosting a fine-tuned BERT model
- SageMaker endpoint configured with auto-scaling to handle variable load
- Classification results written back to DynamoDB as new columns
- Failed messages route to the same central DLQ
For historical documents already processed by Stage 1, we built a Glue job that hydrated the Stage 2 queue with messages for documents that needed classification.
The use of SageMaker allowed us to deploy a custom model that had been fine-tuned on the customer’s specific document types. The auto-scaling ensured we could handle peak loads whilst keeping costs down during quieter periods.
Stage 3: Content Extraction
Timeline: 1 additional month
The third stage added semantic understanding:
- Stage 2 Lambda emits message to Stage 3 SQS queue upon successful classification of specific document types
- Content extraction Lambda processes messages from the queue
- Lambda invokes Amazon Bedrock with Claude Haiku model
- Claude extracts relevant content from documents
- Extracted content written back to DynamoDB as new columns
- Failed messages route to the same central DLQ
Critically, only documents classified as specific document types in Stage 2 progressed to Stage 3. This filtering ensured that content extraction was only performed on documents where it would add value. The Glue hydration job similarly queued only correctly classified documents to the Stage 3 queue.
Using Bedrock with Claude Haiku provided excellent results for content extraction. The model’s ability to understand context and extract structured information from unstructured text was exactly what we needed.
Deployment Pipeline Architecture
Beyond the document processing workflow itself, we implemented a robust CI/CD pipeline using AWS CDK and CodePipeline to manage deployments across multiple environments and AWS accounts.

Multi-Account Strategy
We adopted a multi-account AWS strategy following best practices:
- Shared Services Account: Hosted the CodeCommit repository, CDK pipeline (CodePipeline) and cross-account deployment orchestration
- NonProd Account: Contained both Dev and UAT environments for development and user acceptance testing
- Prod Account: Hosted the production environment with appropriate guardrails and separation
Pipeline Flow
The CDK pipeline automatically triggered on commits to the CodeCommit repository and followed this deployment sequence:
-
Dev Environment Deployment (NonProd Account)
- KMS stack: Encryption keys for the environment
- Storage stack: S3 buckets and DynamoDB table with GSIs
- Processing stack: Lambda functions, SQS queues and associated IAM roles
- Dashboard stack: CloudWatch dashboards for monitoring
-
Integration Testing
- Separate CloudFormation stack deployed integration tests in the NonProd account
- Tests validated the complete workflow end-to-end with sample documents
- Lambda function in the Shared Services account performed cross-account status checks
- Pipeline only proceeded if all tests passed; otherwise the deployment failed
-
UAT Environment Deployment (NonProd Account)
- Same stack structure as Dev (KMS, Storage, Processing, Dashboard)
- Deployed only after successful Dev integration tests
- Used for user acceptance testing before production release
-
Prod Environment Deployment (Prod Account)
- Final deployment to production following successful UAT validation
- Same stack structure with production-grade configuration
- Required manual approval gate in the pipeline
CloudFormation Stack Organisation
Each environment consisted of four CloudFormation stacks deployed in sequence:
- KMS Stack: Customer-managed KMS keys for encrypting data at rest
- Storage Stack: S3 buckets (with encryption) and DynamoDB table with three GSIs
- Processing Stack: All Lambda functions, SQS queues, DLQ, Glue jobs and IAM roles/policies
- Dashboard Stack: CloudWatch dashboards providing visibility into processing metrics and system health
This modular stack design allowed us to update individual components independently whilst maintaining clear dependencies between infrastructure layers.
Cross-Account Integration Testing
The integration testing mechanism was particularly important for ensuring quality:
- Integration test stack created test S3 objects, triggered the workflow and validated results
- Test Lambda in Shared Services account assumed a cross-account IAM role in NonProd
- Cross-account checks queried DynamoDB to verify test completion
- Pipeline advancement required the test to report success status
This approach gave us confidence that each deployment was fully functional before promoting to the next environment.
Key Design Decisions
SQS vs Step Functions
One question that often arises when building serverless workflows is whether to use SQS or Step Functions for orchestration.
We chose SQS for several reasons:
Concurrency control: SQS allows fine-grained control over Lambda concurrency between stages. This is critical when downstream services (like SageMaker endpoints or Bedrock) have rate limits or when you want to control costs by limiting concurrent executions.
Cost at scale: For high-volume workloads, SQS is significantly cheaper than Step Functions. At 50 million documents, with each document going through three stages, we’re looking at a considerable amount of state transitions.
Simplicity: Each stage is independent and can be developed, tested and deployed separately. There’s no need for complex state machine definitions.
Natural backpressure: SQS provides natural backpressure when downstream systems are under load. Messages remain in the queue until they can be processed.
Replay capability: The Glue hydration jobs gave us the ability to replay processing for any stage, which proved invaluable during development and testing.
The main disadvantage of SQS compared to Step Functions is the lack of built-in visibility into the overall workflow state of a given document. However, we addressed this by storing comprehensive status information in DynamoDB and building CloudWatch dashboards to monitor progress.
Central Dead Letter Queue
Rather than having separate DLQs for each stage, we implemented a single central DLQ with a single Lambda function to handle all failures.
This approach had several benefits:
- Simplified operations: One place to monitor for failures across all stages
- Consistent error handling: The same error handling logic applied regardless of which stage failed
- Better visibility: CloudWatch metrics and alarms could focus on a single DLQ depth metric
The DLQ processor Lambda was responsible for parsing the failed message, determining which stage it came from and updating the DynamoDB record with appropriate error information and status.
DynamoDB Table Design and GSI Strategy
We used a single DynamoDB table to store all document metadata and processing results. Each stage added its own unique columns to the table with no overlap beyond the initial Stage 1 metadata. This design allowed us to track the complete processing lifecycle of each document in one place.
To address read-on-partial-write scenarios and enable efficient querying of completed documents at each stage, we implemented three Global Secondary Indexes (GSIs):
- Stage 1 GSI: Indexed on Stage 1 completion date
- Stage 2 GSI: Indexed on Stage 2 completion date
- Stage 3 GSI: Indexed on Stage 3 completion date
Each completion date column was only written when a document successfully completed that stage. This meant we could query for all documents completed at a specific stage without returning partially processed documents.
Critically, to mitigate the risk of hot DynamoDB partitions and throttling, we partitioned each GSI using a random partition ID between 0 and 29. This spread the write load across 30 partitions per date, preventing any single partition from becoming a bottleneck during high-volume processing. Queries would fan out across all 30 partitions and merge the results.
Observability with PowerTools and X-Ray
We used AWS Lambda PowerTools throughout the solution to provide structured logging, tracing and metrics.
PowerTools gave us:
- Structured logging: JSON-formatted logs that could be easily queried in CloudWatch Logs Insights
- Distributed tracing: Automatic X-Ray tracing (sampling) to help troubleshoot any bottlenecks
- Custom metrics: Application-level metrics pushed to CloudWatch without impacting performance
X-Ray sampling was configured to capture a representative sample of traces without incurring excessive costs. During the initial rollout we sampled at a higher rate to catch any issues early, then reduced the sampling rate once we were confident the system was stable.
This level of observability proved invaluable for troubleshooting issues and understanding system behaviour under load.
The Results
The solution successfully processed more than 50 million documents in just a few weeks once all three stages were deployed.
The customer was able to:
- Access extracted text and metadata for all documents
- Query documents by a variety of predicates
- Integrate the processed data into downstream business systems
- Continue processing new documents as they arrived with minimal latency
The serverless architecture meant that costs remained predictable and scaled linearly with usage. During the initial backlog processing, the system automatically scaled up to handle thousands of documents per minute. Once the backlog was cleared, it scaled back down to handle the steady-state load of new documents.
Lessons Learnt
Building an IDP solution at this scale taught us several valuable lessons:
Start simple, iterate quickly: By breaking the solution into three stages, we were able to deliver value early and iterate based on real-world usage patterns.
Serverless is perfect for batch workloads: The ability to scale from zero to hundreds of concurrent executions and back to zero made serverless ideal for processing the historical backlog.
Observability is not optional: PowerTools and X-Ray were essential for understanding system behaviour and troubleshooting issues at scale.
SQS provides excellent value: For high-volume workloads, SQS offers the best combination of cost, performance and operational simplicity.
DynamoDB single-table design works well: Storing all document metadata, processing status and results in a single DynamoDB table simplified the architecture and reduced costs compared to using multiple tables or RDS.
Plan for failure: The central DLQ and comprehensive error handling meant that no documents were lost, even when individual processing steps failed.
Conclusion
Building an Intelligent Document Processing solution doesn’t require complex infrastructure or months of development time. By leveraging AWS serverless services, we were able to go from initial architecture to a production system processing millions of documents in under 4 months.
The key was to start simple, deliver value incrementally in collaboration with the customer and use managed services wherever possible. The combination of S3, SQS, Lambda, DynamoDB, SageMaker and Bedrock provided a robust, scalable and cost-effective foundation for processing more than 50 million documents.
If you’re facing a similar challenge and need help building serverless solutions on AWS, Virtuability can help. We specialise in AWS serverless architectures, AI/ML integration and delivering business value quickly through iterative development.
Visit us at https://www.virtuability.com to learn more about how we help businesses leverage AWS to meet their objectives.