The Invoice That Started a Three-Month Cleanup Project
When one of our managed services clients forwarded their AWS invoice last quarter with the subject line “is this normal?”, I already had a guess at the answer before opening the PDF. Their monthly spend had drifted from roughly $4,200 to just under $11,000 over eighteen months, with no new product launches and no traffic spikes to justify it. What followed was a methodical engagement built almost entirely around AWS Cost Explorer, supported by a handful of Boto3 scripts and a much stricter tagging discipline. This walkthrough follows that same project, step by step, because the pattern repeats across nearly every environment we audit.
The interesting part is not the dollar figure; it is how cleanly the waste sorted itself once we stopped guessing and started looking at the right dimensions inside Cost Explorer.
Why Cost Explorer Earns Its Place Before Any Third-Party Tool
I will take a position here that some readers will disagree with: for the first ninety days of any cost cleanup, you do not need Cloudability, CloudHealth, or any external dashboard. AWS Cost Explorer is sufficient, it is free for the basic views, and it forces your team to learn the AWS billing data model directly rather than through a vendor’s abstraction. Once you understand the underlying structure, evaluating a paid platform becomes a much more honest conversation.
Cost Explorer lets you slice spending by service, linked account, region, instance type, usage type, and — critically — by cost allocation tag. Pair that with anomaly detection, which AWS will run on your account at no charge, and you have a reasonable monitoring foundation before you spend a dollar on third-party tooling.
Enabling the Pieces That Are Off by Default
The first surprise on most engagements is that several useful features are not turned on out of the box. Before you can analyze anything historically, you need to enable Cost Explorer itself in the Billing and Cost Management console, which begins populating data going back up to thirteen months. Cost allocation tags must be activated individually under Billing → Cost Allocation Tags; tagging your resources does nothing for reporting until you mark each tag as active.
For our client, we activated four tags as the baseline: Environment, Owner, Application, and CostCenter. This took twenty minutes in the console and roughly two weeks of nagging engineers to backfill the tags on existing resources. The nagging was the hard part.
Step 1: Establish the Baseline Before Changing Anything
Resist the urge to start terminating instances on day one. The first job is to understand what “normal” looks like, because you cannot prove savings later if you never measured the starting point. In Cost Explorer, I build three saved reports during the baseline phase:
- Monthly spend by service, twelve-month view — establishes which services dominate the bill. For most of our SMB clients, EC2, RDS, and data transfer account for 70 to 85 percent of the total.
- Daily EC2 spend by usage type, ninety-day view — exposes weekend and overnight spend, which is often the clearest waste signal.
- Monthly spend by linked account and tag, twelve-month view — surfaces which teams or applications are driving growth.
Save these reports rather than rebuilding them each visit; Cost Explorer’s saved report feature is genuinely useful and underused. Export the underlying CSVs to S3 as well, because the Cost Explorer UI only retains thirteen months of data, and you will want a longer trail eventually.
Step 2: Read the EC2 Right-Sizing Recommendations Honestly
Inside Cost Explorer, under Recommendations → Rightsizing recommendations, AWS analyzes the last fourteen days of CloudWatch metrics for each running instance and suggests modifications or terminations. The recommendations are useful, but they require interpretation; treating them as gospel is a mistake I have watched teams make repeatedly.
The engine looks at maximum CPU utilization, network I/O, and — if you have the CloudWatch agent installed — memory utilization. Without the agent, memory is invisible to the recommendation engine, which means it will happily suggest downsizing a database instance that is CPU-idle but memory-saturated. We always install the CloudWatch agent on candidate instances for at least two weeks before acting on a recommendation; otherwise you are right-sizing on half the data.
A Caveat About the Fourteen-Day Window
The fourteen-day analysis window is a real limitation. Month-end batch jobs, quarterly reporting workloads, and seasonal traffic patterns will not show up if they fall outside that window. For workloads with known cyclical peaks, I cross-reference the recommendation against a custom CloudWatch dashboard covering at least sixty days before approving a change.
For the client engagement, the recommendations identified eleven instances as candidates for downsizing and four for termination. After overlaying the longer-window CloudWatch data, two of the downsizing candidates turned out to be month-end ETL workers that needed their current size; the rest were genuine waste.
Step 3: Automate the Boring Cleanup Tasks
Cost Explorer is excellent at telling you where the waste is; it does nothing about waste on its own. For repeatable cleanup, we lean on a small library of Python scripts using Boto3, which is the same approach the AWS administration literature has recommended for years. Think of Cost Explorer as the architecture diagram and the scripts as the contractors who actually do the work.
The two scripts that pay for themselves within the first month of any engagement are an off-hours instance stopper for non-production environments and an unattached EBS volume cleaner. Here is the structure of the off-hours stopper we deploy as a Lambda function on an EventBridge schedule:
# off_hours_stop.py
# Runs nightly via EventBridge at 19:00 in the client's primary region.
# Stops any EC2 instance tagged Environment=dev or Environment=staging
# that does not also carry the override tag AlwaysOn=true.
#
# We deliberately stop rather than terminate so that EBS state is preserved
# and a developer can resume work the next morning without redeployment.
import boto3
# Iterate every region the client uses; hardcoding us-east-1 has burned
# us before when a team spun up GPU instances in us-west-2 for a POC.
REGIONS = ['us-east-1', 'us-west-2', 'eu-west-1']
def stop_non_prod_instances():
for region in REGIONS:
ec2 = boto3.client('ec2', region_name=region)
# Filter on tag values and running state in a single API call
# to avoid pulling every instance in the account.
response = ec2.describe_instances(
Filters=[
{'Name': 'tag:Environment', 'Values': ['dev', 'staging']},
{'Name': 'instance-state-name', 'Values': ['running']}
]
)
targets = []
for reservation in response['Reservations']:
for instance in reservation['Instances']:
tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}
# Respect the override tag for instances that must stay up
# such as nightly build agents or shared dev databases.
if tags.get('AlwaysOn', '').lower() != 'true':
targets.append(instance['InstanceId'])
if targets:
ec2.stop_instances(InstanceIds=targets)
print(f"{region}: stopped {len(targets)} instances")
if __name__ == '__main__':
stop_non_prod_instances()
If you are new to invoking AWS from scripts, our walkthrough on PowerShell REST API calls with Invoke-RestMethod covers the same authentication patterns from the Windows side, which is sometimes the easier starting point for infrastructure teams already comfortable with PowerShell.
Step 4: Set Budgets and Anomaly Detection as Your Safety Net
Right-sizing and cleanup reduce the current bill, but they do nothing to catch the next surprise. AWS Budgets and Cost Anomaly Detection are the safety net, and both should be configured before you consider the engagement complete.
For Budgets, we typically configure three tiers per account: a soft alert at 50 percent of the expected monthly spend, a warning at 80 percent, and a hard alert at 100 percent. Be aware of an important limitation — Budgets will notify you, but it will not stop anything on its own. If you need automated enforcement, you have to wire Budgets to an SNS topic and a Lambda function, and even then you should think carefully about what “enforcement” means in production.
Cost Anomaly Detection is the more interesting feature for established environments. It builds a machine learning model of your normal spending pattern per service and alerts when a deviation crosses a threshold you define. We typically configure it at the service level for the top five services and at the linked account level for everything else.
Step 5: Address Storage, Because EC2 Is Rarely the Whole Story
EC2 right-sizing gets the headlines, but storage waste often runs a close second. Unattached EBS volumes from terminated instances, oversized GP2 volumes that should have been migrated to GP3 years ago, and unused EBS snapshots from a long-forgotten backup script all add up.
For object storage, region selection and lifecycle policies matter enormously; we covered the regional dimension in detail in our piece on S3 bucket region selection for Veeam cloud backups, and the same principles apply to general-purpose S3 buckets. Cost Explorer’s usage type breakdown will show you exactly which storage class is dominating your spend.
What the Numbers Looked Like After Ninety Days
For the client I mentioned at the start, the final breakdown was less dramatic than the cleanup work suggested. EC2 right-sizing recovered roughly $1,800 per month, off-hours scheduling on dev and staging recovered another $1,400, EBS cleanup saved around $600, and switching two reserved instance candidates from on-demand to one-year savings plans accounted for another $1,200. Total monthly reduction was approximately $5,000, which brought them back close to their original baseline.
Notice that no single action saved the day; the result came from a methodical pass through several categories. This is the typical shape of a cost optimization engagement, and it is why I am skeptical of any vendor promising 40 percent savings from a single dashboard.
A Practical Takeaway You Can Apply This Week
If you do nothing else after reading this, do these three things in your AWS account this week: enable Cost Explorer if it is not already on, activate at least the Environment and Owner cost allocation tags, and configure Cost Anomaly Detection for your top three services. Total time investment is under an hour, and you will have visibility you almost certainly did not have yesterday.
For the larger right-sizing and automation work, we frequently take this on as a fixed-scope engagement for clients who would rather not build the Boto3 tooling in-house. If that sounds useful, get in touch through our contact form and we can scope an initial review against your current bill. As reference material on the broader governance side, the NIST Cybersecurity Framework and the CIS Benchmarks for AWS both contain useful guidance on the tagging and account structure that make cost work easier to sustain over time.