Nov 01, 2025

Virtuability Works With UK Financial Services Firm To Deliver An Intelligent Document Processing Solution

8 minutes read

Introduction

Virtuability, a Professional Services consulting company and AWS Select Services Partner, partnered with a UK financial services firm to deliver an Intelligent Document Processing (IDP) solution that successfully processed more than 50 million documents in just a few weeks.

The entire system went from initial concept to production in under 4 months using AWS serverless services, demonstrating how modern AWS-native architectures can unlock business value at scale.

Challenges

The customer had accumulated tens of millions of documents over several years that contained critical business data. After initial analysis, Virtuability made a number of observations to help address the challenges the customer was facing:

Data Accessibility: Critical business information was locked within millions of unstructured documents and inaccessible to downstream business systems
Document Classification: Each document needed to be classified by type using machine learning to enable appropriate downstream processing
Content Extraction: Relevant information needed to be intelligently extracted from documents to support business decision-making
Processing Scale: The solution needed to handle both a massive historical backlog and ongoing daily document arrivals from multiple source systems
Time-to-Value: Speed was critical - the customer needed results quickly and iterative delivery of incremental value at each stage

Why Virtuability?

Virtuability has a strong history of collaboration with customers in the SaaS and Financial services sectors. We are specialised AWS Cloud experts with a team of consultants who work in several technology domains.

Our AWS Services Partner status has over the years validated our expertise and ongoing commitment to AWS Cloud.

Solutions

Virtuability designed and implemented a comprehensive IDP solution leveraging AWS serverless services and best practices.

Architecture

IDP Architecture Diagram

The IDP Architecture above illustrates the three-stage processing workflow and key components of the solution:

Stage-based Processing: Clear separation between text extraction, ML classification and AI-powered content extraction
Serverless Foundation: Lambda, SQS and DynamoDB providing automatic scaling and pay-per-use economics
ML Integration: SageMaker endpoints for custom document classification models
AI-powered Extraction: Amazon Bedrock with Claude for intelligent content extraction
Centralised Error Handling: Single Dead Letter Queue with unified error processing
Comprehensive Observability: CloudWatch dashboards and PowerTools for monitoring and troubleshooting

Why Serverless?

Virtuability chose AWS serverless services for several reasons:

Pay-per-use: No need to provision or pay for long-running, idle infrastructure
Auto-scaling: Automatic scaling to handle variable workloads from zero to thousands of documents per minute
Reduced operational overhead: No servers to patch or maintain
Speed to market: Focus on business logic rather than infrastructure management
Cost-effective at scale: Particularly when using services like SQS and Lambda with pay-per-request pricing

Stage 1: Text Extraction and Metadata Storage

Timeline: 2 months from concept to production

The first stage established the foundation:

Transfer service collects documents from multiple source systems and uploads to an S3 bucket
S3 Event Notification triggers and sends message to an SQS queue
Lambda function processes messages from the queue, extracts document text and metadata
Contents and metadata stored in DynamoDB
Failed messages route to a central Dead Letter Queue (DLQ)
Separate DLQ processor Lambda adds error message and status to DynamoDB

This stage was critical to get right. Virtuability ensured that every document was tracked, that failures were captured and that there was full visibility into the processing workflow.

Stage 2: Document Classification

Timeline: 1 additional month

The second stage added intelligent classification:

Stage 1 Lambda emits message to Stage 2 SQS queue upon successful processing
Classification Lambda processes messages from the queue and invokes a SageMaker endpoint hosting a fine-tuned BERT model
SageMaker endpoint configured with auto-scaling to handle variable load
Classification results written back to DynamoDB as new columns
Failed messages route to the same central DLQ

For historical documents already processed by Stage 1, Virtuability built a Glue job that hydrated the Stage 2 queue with messages for documents that needed classification.

The use of SageMaker allowed deployment of a custom model that had been fine-tuned on the customer’s specific document types. Auto-scaling ensured peak loads could be handled whilst keeping costs down during quieter periods.

Stage 3: Content Extraction

Timeline: 1 additional month

The third stage added semantic understanding:

Stage 2 Lambda emits message to Stage 3 SQS queue upon successful classification of specific document types
Content extraction Lambda processes messages from the queue and invokes Bedrock with Claude Haiku model
Claude extracts relevant content from documents
Extracted content written back to DynamoDB as new columns
Failed messages route to the same central DLQ

Critically, only documents classified as specific document types in Stage 2 progressed to Stage 3. This filtering ensured that content extraction was only performed on documents where it would add value.

Using Bedrock with Claude Haiku provided excellent results for content extraction. The model’s ability to understand context and extract structured information from unstructured text was exactly what was needed.

Deployment Pipeline

Beyond the document processing workflow itself, Virtuability implemented a CI/CD pipeline using AWS CDK and CodePipeline to manage deployments across multiple environments and AWS accounts.

Deployment Pipeline Diagram

Multi-Account Strategy

A multi-account AWS strategy was adopted following best practices:

Shared Services Account: Hosted the CodeCommit repository, CDK pipeline (CodePipeline) and cross-account deployment orchestration
NonProd Account: Contained both Dev and UAT environments for development and user acceptance testing
Prod Account: Hosted the production environment with appropriate guardrails and separation

Pipeline Flow

The CDK pipeline automatically triggered on commits to the CodeCommit repository and followed this stack deployment sequence per environment:

KMS Stack: Customer-managed KMS keys for encrypting data at rest
Storage Stack: S3 buckets (with encryption) and DynamoDB table with three GSIs
Processing Stack: All Lambda functions, SQS queues, DLQ, Glue jobs and IAM roles/policies
Dashboard Stack: CloudWatch dashboards providing visibility into processing metrics and system health

Integration testing was particularly important for ensuring quality. A separate CloudFormation stack deployed and ran integration tests in the Dev environment to validate the complete workflow end-to-end with sample documents.

Key Design Decisions

SQS vs Step Functions

Virtuability chose SQS over Step Functions for several reasons:

Concurrency control: SQS allows fine-grained control over Lambda concurrency between stages, critical when downstream services like SageMaker endpoints have concurrent execution constraints
Cost at scale: For high-volume workloads, SQS is significantly cheaper than Step Functions at 50 million documents
Simplicity: Each stage is independent and can be developed, tested and deployed separately
Natural backpressure: SQS provides natural backpressure when downstream systems are under load
Replay capability: Glue hydration jobs provided the ability to replay processing for any stage

Central Dead Letter Queue

Rather than having separate DLQs for each stage, Virtuability implemented a single central DLQ handling with a single Lambda function to handle all failures. This approach simplified operations, provided consistent error handling and better visibility through CloudWatch metrics.

DynamoDB Table Design

A single DynamoDB table stored all document metadata and processing results. To address read-on-partial-write scenarios and enable efficient querying of completed documents at each stage, three Global Secondary Indexes (GSIs) were implemented.

Critically, to mitigate the risk of hot DynamoDB partitions and throttling, each GSI was partitioned using a random partition ID between 0 and 29. This spread the write load across 30 partitions per date, preventing any single partition from becoming a bottleneck during high-volume processing.

Observability

AWS Lambda PowerTools was used throughout the solution to provide structured logging, tracing and metrics. X-Ray sampling was configured to capture a representative sample of traces without incurring excessive costs. This level of observability proved invaluable for troubleshooting issues and understanding system behaviour under load.

AWS Enablers

AWS offers a suite of powerful tools and services that enabled the IDP solution.

Amazon S3

Amazon S3 provides highly durable object storage. S3 served as the landing zone for incoming documents and provided event notifications to trigger the processing workflow.

Amazon SQS

Amazon SQS is a fully managed message queuing service. SQS decoupled the processing stages, provided natural backpressure and enabled fine-grained concurrency control.

AWS Lambda

AWS Lambda is a serverless compute service that allows code to run without provisioning or managing servers. Lambda functions processed documents at each stage, scaling automatically with demand.

Amazon DynamoDB

Amazon DynamoDB is a fully managed NoSQL database service. DynamoDB stored all document metadata and processing results with consistent performance at any scale.

Amazon SageMaker

Amazon SageMaker is a fully managed machine learning service. SageMaker hosted the fine-tuned BERT model for document classification with auto-scaling endpoints.

Amazon Bedrock

Amazon Bedrock is a fully managed service for foundation models. Bedrock with Claude Haiku provided intelligent content extraction from classified documents.

AWS CDK and CodePipeline

The AWS CDK and CodePipeline enabled infrastructure as code and continuous delivery across multiple AWS accounts.

Business Outcomes

Unlocked critical business data in weeks that was previously inaccessible in millions of documents
Enabled rapid decision-making by making historical document insights immediately available to downstream business systems
Reduced time-to-value through iterative delivery - the customer started seeing results after just 2 months with Stage 1, rather than waiting for a complete solution
Achieved predictable, linear costs that scaled with usage, avoiding large upfront infrastructure investments
50+ million documents processed end-to-end with text extraction, ML classification and AI-powered content extraction
4 months from concept to production across three iterative stages
Virtually no operational overhead for infrastructure management, allowing the team to focus on business logic

Conclusion

The collaboration between Virtuability and the UK financial services firm demonstrates how AWS serverless services can unlock business value at massive scale. By leveraging S3, SQS, Lambda, DynamoDB, SageMaker and Bedrock, Virtuability delivered a robust, scalable and cost-effective solution for processing more than 50 million documents in under 4 months.

The key was to start simple, deliver value incrementally and use managed services wherever possible. The serverless architecture meant that costs remained predictable and scaled linearly with usage - during the initial backlog processing, the system automatically scaled up to handle thousands of documents per minute, then scaled back down to handle the steady-state load of new documents.

This engagement illustrates how modern AWS-native architectures don’t require complex infrastructure or months of development time to deliver transformational business outcomes.