How we saved more than 42% on AWS infrastructure costs.

Context.

The client is a no-code platform to build mobile apps for a Shopify store. The platform is hosted on AWS and serves customers across the globe. They have been in operation for 7+ years and have a 30+ member dev team with not much DevOps expertise in the team.

Problem Statement.

Reduce existing cloud infrastructure cost of $25k/month without affecting the business metrics and uptime.

Implement cloud governance and infrastructure access policies for security.

Continuous billing monitoring, alerts and deeper insights into cost allocations with various functions within the company.

%
Cost reduction for
AWS Compute service
%
Cost reduction for
AWS RDS
%
Cost reduction for
AWS Data transfer

Outcome/Impact.

42% cost reduction from $25k/month to $14k/month with just 1 DevOps engineer who’s also supporting 30+ member dev team with daily operations/incidents. The time spent on the cost reduction effort was about two months.

No impact on uptime, product launch velocity, and other business metrics during the cost reduction period.

Over 35% production cost reduction and 50% non-production cost saving.

Set up continuous cost monitoring and anomaly detection with granular visibility of cost allocation.

Solution.

Cost reduction over the time
Cost reduction over the time

This was the cost reduction over 6 month period. Cost reduction happened without affecting the product iteration speed, and the dev team kept on shipping features in parallel.

We started the cost reduction exercise by analyzing the billing export from AWS for the last few months. We thoroughly analyzed the current cost spent across regions and AWS services.

We found that the Compute and RDS costs were abnormally high. We worked closely with the dev team to dive deeper into each major cost categories.

We optimized Compute costs in the following way.

    1. Removed numerous unattached EBS volumes due to terminated instances, saving $2k/month. Also, identified the root cause and refined the process for creating new instances.
    2. Optimized other unused Compute resources. For EC2, we scrutinized CPU usage and optimized instances with under 50% CPU utilization.
    3. Further examined memory usage for the above EC2 instances, optimizing those with less than 50% memory utilization.
    4. Identified overprovisioned VMs and implemented a zero-downtime approach for downsizing, ensuring no impact on user traffic.
Compute cost reduction
Compute cost reduction

Overall cost reduction for Compute resources was 42%.

It was then time to look at how we can reduce RDS costs. For this, we did the following.

RDS cost reduction
RDS cost reduction
  1. The existing RDS had occasional CPU spikes and couldn’t be downsized directly.
  2. We enabled and analysed slow query logs and worked with the dev team to implement application side fixes. This resulted in more predictable load on the DB and then we could optimize the database infrastructure.
  3. We downsized RDS instance types using zero downtime strategy to avoid business impact to users.
  4. We also renegotiated Reserved Instances plans due to the change in the instance types.

Overall cost reduction for RDS was 56%.

The next highest cost was AWS data transfer.

  1. We reduced intra-region data transfer costs using VPC flow logs and moved the highest chatting resources in the same zones.
  2. We also found out that there was a duplication of job processing for background workers. We worked with the dev team to identify duplicate jobs and stop them.
  3. We also implemented VPC endpoints for Opensearch, the application log store thereby saving data transfer costs.
Data transfer cost reduction
Data transfer cost reduction

Overall data transfer cost reduction was 90%

For cloud governance,

  • We streamlined the IAM access via groups and policies and ensured no direct permissions are assigned to any AWS user.
  • We set billing alerts and cost anomaly alerts for important AWS services
  • We tagged all the important resources using labels. This enabled us to have continuous cost monitoring per environment.

Watch this talk for a detailed understanding of the work, presented at the Cloud Cost Conference in July 2023.