AWS RDS Instance Too Slow How to Fix It Fast

AWS RDS Running Slow — Here’s How to Actually Fix It

RDS performance has gotten complicated with all the conflicting advice flying around. Upgrade your instance. Tune your queries. Blame your ORM. Add a cache layer. Everyone has an opinion, and most of it costs you time and money before you even know what’s broken.

As someone who’s spent the last five years managing production RDS clusters — everything from scrappy SaaS startups to companies running real-time analytics at scale — I learned everything there is to know about diagnosing this specific problem. Today, I will share it all with you.

Probably should have opened with this section, honestly. Run diagnostics first. Stop guessing.

Most slowness traces back to one of four things. You can find which one in under ten minutes. Here’s how.

Start Here — Check Your CloudWatch Metrics First

Open the AWS console. Go to RDS, find your instance, hit the Monitoring tab. CloudWatch metrics are sitting right there. Don’t scroll past them.

These five numbers tell you almost everything:

  • CPUUtilization — sustained above 80% is a red flag, full stop
  • ReadIOPS and WriteIOPS — compare what you’re seeing against your provisioned IOPS limit
  • DatabaseConnections — if this climbs while queries slow down, you’ve got a connection pooling problem
  • FreeableMemory — drop below 10% of total instance memory and you’re getting cache evictions and misses

If you’d rather pull raw data than stare at dashboards, use the CLI. This command gets CPUUtilization for the past hour:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=your-db-name \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average,Maximum

Sustained 85% CPU with normal query patterns means the instance genuinely can’t handle the load. A spike to 95% for 30 seconds during a batch job? That’s normal. Learn to tell the difference — at least if you want to stop chasing false leads.

IOPS tells a different story. Hitting your provisioned limit consistently means storage is the bottleneck. CPU fine, IOPS fine, queries still crawling? Suspect connection exhaustion or a missing index. They’re not the same problem and they don’t share a fix.

Context matters here. A web app running 70% CPU at peak hours is fine. That same 70% sustained at 3 AM while nothing’s scheduled? Something leaked and is looping. Those are two completely different situations.

Instance Class Is the Most Common Culprit

I’ve watched startups pick a db.t3.medium “to save money” and suffer quietly for six months straight. The t3 family uses CPU credits — burstable performance that sounds great until you actually need it consistently.

Here’s how it works. Under your baseline load, you earn credits. Exceed baseline, you burn them. Burn them all, and your instance gets throttled hard. Your queries go from 40ms to 3 seconds. Everyone panics. Nobody knows why.

Check your burst balance in CloudWatch. In the RDS console, Monitoring tab, look for the CPUCreditBalance metric. If that number drops consistently during peak traffic, you’ve found your villain. The t3 instance isn’t sized for your actual workload — it’s sized for your best-case workload, which apparently doesn’t exist.

I’m apparently a db.m6g person and the m6g.large works for me while t3 instances never keep up once traffic gets real. Don’t make my mistake of waiting six months to figure that out.

Here’s how the options compare:

Instance Type Baseline CPU Memory Best For
db.t3.medium 10% 4 GB Development, low-traffic staging
db.t3.large 20% 8 GB Lightweight SaaS, 5–50 concurrent users
db.m6g.large 50% sustained 8 GB Production web apps, 100+ concurrent users
db.m6g.xlarge 100% sustained 16 GB High-traffic SaaS, real-time analytics

The cost gap between a t3.medium and an m6g.large runs roughly $40 to $150 per month depending on region. Your engineers spending three days debugging mystery slowness costs more than that delta by Tuesday afternoon.

Upgrading without downtime: spin up a read replica, promote it to standalone, cut your application traffic over, retire the old instance. Most teams do the whole thing in 15 minutes. It’s not the scary operation people make it out to be.

IOPS Limits Are Silently Throttling Your Queries

But what is an IOPS ceiling? In essence, it’s a hard cap on how many read and write operations your storage can handle per second. But it’s much more than that — it’s invisible, it doesn’t throw obvious errors, and it makes your database look slow for reasons that have nothing to do with your queries or your code.

Most RDS instances ship with gp2 volumes — general purpose SSD. gp2 IOPS scale with volume size at 3 IOPS per GB. A 100GB volume gets 300 baseline IOPS. That sounds reasonable until you actually put traffic on it. Under any real workload, you’re over it before lunch.

Frustrated by gp2 limits on a 500-user workload, we migrated to gp3 using the RDS storage modification wizard, waited about 20 minutes, and watched latency drop by 60% that same afternoon. gp3 lets you provision IOPS independently from volume size — 100GB storage with 3,000 IOPS if you want it. Pricing is predictable and the math works out quickly.

Check CloudWatch for the DiskQueueDepth metric. A queue depth of 2 is normal. A depth of 20 during peak traffic means queries are waiting on disk — not CPU, not memory, not your application code. An average consistently above 10 means your storage genuinely can’t keep up.

Migrating from gp2 to gp3 used to require a full snapshot and restore. Now RDS has a built-in migration path. Start it, grab coffee, come back 20 minutes later. That’s the whole process.

The numbers: gp3 starts at $0.10 per GB-month — same as gp2 — plus $0.015 per provisioned IOPS-month. Provisioning 3,000 IOPS on a 500GB volume adds roughly $45/month. Compare that to one hour of production downtime and the math stops being a conversation.

Too Many Connections Are Killing Performance

So, without further ado, let’s dive into the problem nobody talks about until it bites them: connection exhaustion. Your application creates a new database connection for every request. Your Lambda functions all try to connect at once. Suddenly your query latency spikes and your logs start showing “too many connections” errors. This isn’t gradual degradation — it’s a ceiling you hit hard and without warning.

RDS sets max_connections based on instance memory. A db.t3.medium with 4GB RAM gets around 248 connections. A db.m6g.large with 8GB gets around 1,000. Sounds generous. Then 50 Lambda functions spin up simultaneously during a traffic spike and fill the pool in about four seconds.

Check what’s actually happening. Query the database directly:

SELECT COUNT(*) FROM information_schema.processlist WHERE command != 'Sleep';

Or open Performance Insights in the RDS console — it gives you a cleaner view. If your application is throwing connection pool errors while CPUUtilization sits at 20%, you have too many connections chasing too few available slots. The instance isn’t overloaded. The connection layer is.

The fix depends on your setup. Traditional web servers: tune your application-level connection pool. Lambda functions: use RDS Proxy. That’s not a suggestion — it’s the right tool for the job, as Lambda requires persistent connection management that individual function invocations simply can’t provide cleanly. RDS Proxy sits between your application and your database, multiplexing thousands of incoming connections down to a manageable number of real database connections. Lambda warm starts reuse existing proxy connections. Latency drops. Errors stop. You stop hitting the ceiling.

Enabling it is straightforward: RDS → your instance → Create Proxy. Cost runs about $0.015 per proxy hour plus data transfer — around $11/month for a typical workload. That’s what makes RDS Proxy endearing to us serverless people who’ve been burned by connection storms.

Enable Slow Query Logs to Find the Real Problem

You’ve checked metrics. Instance class looks right. IOPS are provisioned. Connections are pooled. Queries still drag. Time to find the actual bad query — the one hiding in plain sight.

Enable slow query logs through your RDS parameter group. Edit the group and set these three parameters:

  • slow_query_log = 1
  • long_query_time = 0.5 (anything over 500ms gets logged)
  • log_output = AWSLOGS (routes logs to CloudWatch Logs instead of disk)

Apply it immediately or schedule your maintenance window. Either way, within a few minutes, slow queries start showing up in CloudWatch Logs. First time I did this on a struggling production instance, the culprit showed up in the first 30 seconds of log data. That was embarrassing but also a huge relief.

Use CloudWatch Logs Insights to query them. This finds your ten slowest queries:

fields @duration, @message
| filter @message like /Query_time/
| stats count() as frequency, avg(@duration) as avg_duration by @message
| sort avg_duration desc
| limit 10

Nine times out of ten, a single query owns the slowness list. Copy it, run EXPLAIN on it in a test environment, look for table scans. Almost always a missing index on a foreign key or filter column — something that should have been there from the beginning but wasn’t. Add the index. Watch latency collapse back to normal. Your on-call phone stops ringing at 2 AM.

That’s the full diagnostic sequence. CloudWatch first. Instance class second. IOPS third. Connections fourth. Slow query logs fifth. Follow this order and you’ll find the problem inside ten minutes — in your actual workload, not some theoretical worst case scenario someone wrote a blog post about in 2019.

Marcus Chen

Marcus Chen

Author & Expert

Robert Chen specializes in military network security and identity management. He writes about PKI certificates, CAC reader troubleshooting, and DoD enterprise tools based on hands-on experience supporting military IT infrastructure.

41 Articles
View All Posts

Stay in the loop

Get the latest team aws updates delivered to your inbox.