Aug 19, 2019
Why Serverless & DevOps makes a (big) difference
Background
We have recently completed a Serverless & DevOps transformation project with one of our clients, CitizenMe. CitizenMe presently has more than 200.000 global end-users and has processed millions of transactions since inception.
CitizenMe uses AWS and mobile apps almost exclusively for its product and service offerings. Most of the components are built and maintained in-house by a small team of developers and testers.
Development of the offerings started in 2013 on AWS and the platform architecture was based on EC2 and Elastic Beanstalk. Over the previous 5 years infrastructure and services were built and configured by hand and the development team would do infrequent, release-based software deployments manually.
CitizenMe’s challenges included considerable problems with quality, availability, scale and consistency. Additionally, few actionable metrics existed and logging didn’t provide the necessary insights to derive more actionable metrics or to troubleshoot issues quickly.
Scope
The transformation project initially focused on the platform API. However, the project quickly expanded in scope to encompass virtually all other components including web services, web apps, reporting services and website. The transformation led to adoption of the following AWS Serverless components:
- API Gateway
- CloudFront
- Lambda (Java, node.js, python)
- ECS Fargate/Docker (node.js)
- Application Load Balancer
- Aurora RDS for MySQL
- DynamoDB
- QuickSight Business Intelligence & Reporting
- Cognito authentication
- AWS Auto Scaling
- CloudWatch Logs, Metrics, Alarms, Dashboards & Insights
- GuardDuty
- Trusted Advisor
- Web Application Firewall (WAF)
Developers transitioned over time from releases to adopt a feature-based release cycle to improve testing, quality and velocity. AWS Codepipeline was adopted for each component to automate the development workflow from GitHub commit, through build & test and to rollout to production using Continuous Delivery. Each pipeline has development, test and production stages.
Peer reviews were successfully implemented in order to improve code quality and documentation. Frameworks to identify code dependency security vulnerabilities were introduced through npm audit and Maven Dependency-Check.
In addition the following non-functional areas were addressed:
- Contextual logging was added to all Java, node.js & python apps and services (using e.g. Java Apache Log4J 2 Thread Contexts). Logged information included AWS request id, start/end time & duration, memory consumption start/end and API call identifiers. This helped us identify and address numerous bugs, performance issues and growth trends
- Auto Scaling of ECS Fargate services based on CPU metrics
- CloudWatch custom metrics for particular counts that helped us understand use of the components both from an operational and business metrics context (API calls, payments, push notifications etc - and failures)
- CloudWatch Alarms on Lambda failures & timeouts, API Gateway request failures, custom metrics alarms for business critical operations etc.
- CloudWatch Dashboards to aggregate business and operationally critical information per application or service, which could be put on a public screen
- GuardDuty to monitor and alert on suspicious network and AWS API activities
- Trusted Advisor reports to assist with identification of cost, security, performance, fault tolerance and service limits issues
- All alerting was sent to operational, security and pipeline Slack channels to alert developers of issues within seconds to minutes of occurrence
Impact
The resulting impact of all of the changes has been profound to CitizenMe. The key performance indicators of the changes are:
- Adopting AWS serverless has reduced operational burden for the whole development team by 75+%
- With automation, logging, metrics and alerts quality of service has been improved considerably across all products with a reduction of major events by 90+%.
- Pipelines and Serverless architectures have virtually removed all deployment outages
- DevOps has increased velocity of individual changes by at least 5x
- Issue reaction time has been reduced from days or weeks to minutes or at most hours depending on urgency. It is not unusual for issues to be detected and rectified in production within an hour from occurrence
- A considerable reduction in new bugs reaching production because each new feature and change is tested in isolation. As a result the test function has also become much more transparent across the organisation
- A considerable reduction of existing bugs due to active logging, monitoring and alerting
- The cost of operating in AWS is almost exactly the same with more than 200.000 users as it was on the old architecture with less than 100.000 users 12 months ago
As a result cultural changes across the organisation have also been observed.
The technical team now alerts the business of encountered issues in advance of end-users contacting CitizenMe. This has improved the business proposition considerably and as a result it has meant much happier customers.
Furthermore, the business now has a much better understanding of the backlog and outstanding issues in the pipeline. Agile practices and the JIRA board have also been much simplified.