AWS API Gateway Timeouts How to Diagnose and Fix

Why API Gateway Timeouts Are Confusing to Debug

Debugging AWS API Gateway timeouts has gotten complicated with all the vague 504s and 503s flying around. I’ve burned more late nights than I’d like to admit staring at generic error responses — no stack trace, no layer identification, nothing useful. Just AWS shrugging at you like a broken vending machine.

But what is an API Gateway timeout, really? In essence, it’s a request that exceeded a time boundary somewhere in the stack. But it’s much more than that — because there isn’t one boundary. There are three, and your request can silently die at any of them.

API Gateway integration timeout — Hard 29-second limit. Non-negotiable.
Lambda duration or backend service latency — Your actual code running slow.
VPC and load balancer network latency — The invisible 100–500ms culprit nobody thinks about until they’re debugging at 11 PM.

Today, I will share it all with you. This is the exact diagnostic flow I’ve used on production systems to pinpoint which layer is failing. Under 15 minutes, if you know where to look. So, without further ado, let’s dive in.

Step 1 — Check CloudWatch Logs for the Exact Failure Point

Probably should have opened with this section, honestly. Most engineers never enable execution logging on their API Gateway stages — which means they’re flying completely blind when something breaks at 2 AM.

Navigate to the AWS Console: API Gateway → Your API → Stages → Select Your Stage → Logs/Tracing.

Enable CloudWatch Logs and set it to at least INFO level. You want execution logs flowing into CloudWatch so you can actually see what’s happening inside the gateway — not just guess.

Once logging is on, trigger a request that times out. Head to CloudWatch → Logs → Log Groups and find the group for your API. It’ll be named something like api-gateway-execution-logs_[api-id]/[stage-name]. Ugly name. Useful data.

Look for these two metrics in the execution log:

IntegrationLatency — How long your backend took to respond.
Latency — Total time from API Gateway’s perspective.

The gap between those two numbers tells you a lot. IntegrationLatency at 28,500ms with a Latency of 29,100ms means your backend is chewing through almost the entire 29-second budget. That’s your first real clue — and honestly, it narrows things down fast.

To pull this data quickly, use CloudWatch Insights. Run this query:

fields @timestamp, integrationLatency, latency, status
| filter status like /5\d\d/
| stats max(integrationLatency) as max_integration, max(latency) as max_total by bin(5m)

This surfaces maximum integration and total latency across 5-minute windows for failed requests. Integration latency consistently hovering near 29,000ms means you’re slamming into the hard limit. Simple as that.

Step 2 — Identify If Your Backend Is the Bottleneck

High integration latency tells you the backend is slow. Now you need to know why — and that part requires actually looking at your function metrics.

For Lambda, open the CloudWatch Metrics console, navigate to the Metrics tab for your function, and pull up the Duration metric over the last hour. Focus on the 99th percentile. Consistently above 20 seconds? Lambda is your problem.

You can also pull this from the CLI in about 10 seconds:

aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Duration \
  --dimensions Name=FunctionName,Value=your-function-name \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Maximum

If you have X-Ray tracing enabled — and you should, I’m apparently too stubborn about this and X-Ray works for me while pure log-diving never gives me the full picture — the service map shows exactly where time goes. Database queries, external API calls, whatever. Navigate to X-Ray → Service Map and click your Lambda node to see the subsegment breakdown. Don’t make my mistake of skipping X-Ray setup during initial deployment.

Not running Lambda? If your backend sits behind an ALB or external service, check that service’s own response time metrics directly. SSH into an EC2 instance behind the load balancer and run something like curl -w %{time_total} https://your-backend to get real numbers fast.

One thing worth burning into memory: even if your Lambda finishes in 28 seconds, API Gateway cuts it off at 29. The fix has to happen on the backend side. The gateway limit isn’t adjustable upward.

Step 3 — Fix the Timeout Based on What You Found

You now know whether the issue is genuine backend slowness or an architectural constraint hitting the 29-second wall. The fix depends entirely on which one you’re dealing with.

Scenario A — Your Backend Is Actually Slow

Optimize it. Profile the Lambda with CloudWatch Logs or X-Ray. Find the slow database query, the unoptimized loop, the missing index — it’s usually something mundane. This is the unglamorous work that solves most timeout problems.

If optimization isn’t realistic — some operations just genuinely take time — move the work async. Use SQS to queue the job and return immediately to the client:

import boto3
import json

sqs = boto3.client('sqs')

def lambda_handler(event, context):
    sqs.send_message(
        QueueUrl='https://sqs.us-east-1.amazonaws.com/[account]/[queue-name]',
        MessageBody=json.dumps(event)
    )
    return {
        'statusCode': 202,
        'body': json.dumps({'message': 'Processing', 'jobId': '...'})
    }

AWS Step Functions works well for complex workflows needing orchestration. Either pattern keeps the synchronous API response under a few seconds — which is all API Gateway actually cares about.

Scenario B — The 29-Second Limit Is the Architectural Constraint

Your backend legitimately needs 35 seconds. Going async isn’t straightforward. You’ve got three real paths forward:

WebSocket API — API Gateway WebSocket doesn’t carry the same timeout restrictions. Push long operations through it and notify the client on completion.
Lambda async invocation with polling — Return a job ID immediately, let the client poll a status endpoint.
Step Functions — Orchestrate multi-step workflows with built-in retry logic and timeout handling at each step.

If someone on your team set a custom integration timeout lower than 29 seconds — which happens more than it should — you can correct it via CLI:

aws apigateway update-integration \
  --rest-api-id [api-id] \
  --resource-id [resource-id] \
  --http-method POST \
  --timeout-in-millis 29000

Default is 29,000 milliseconds. Leave it there unless you have a very specific, documented reason to go lower.

Common Mistakes That Make API Gateway Timeouts Worse

Three gotchas I see constantly — and have personally committed at least two of:

Not enabling X-Ray tracing. Without it, you’re debugging with one hand tied behind your back. Enable it immediately. It costs pennies per million traced requests. Genuinely no excuse to skip it.

Setting a custom integration timeout too low. I watched someone set it to 5,000 milliseconds to “force Lambda to be fast.” That was 2022. It didn’t make Lambda faster. It just broke every request taking more than 5 seconds — which was most of them. The hard limit is 29 seconds. Use it.

Ignoring Lambda cold starts as a timeout factor. A cold start can add 5–10 seconds to the first invocation after a deployment or idle period. If your function barely squeaks under 29 seconds on warm invocations, cold starts will push you over. Consider provisioned concurrency if cold starts are eating your timeout budget — at roughly $0.015 per GB-hour, it’s cheaper than the incident response time you’ll spend debugging it at midnight.

Next time you see a 504: check CloudWatch Logs for IntegrationLatency, check Lambda Duration or your backend’s response time, then fix either the slowness or the architecture. Fifteen minutes. Done.

Why API Gateway Timeouts Are Confusing to Debug

Step 1 — Check CloudWatch Logs for the Exact Failure Point

Step 2 — Identify If Your Backend Is the Bottleneck

Step 3 — Fix the Timeout Based on What You Found

Scenario A — Your Backend Is Actually Slow

Scenario B — The 29-Second Limit Is the Architectural Constraint

Common Mistakes That Make API Gateway Timeouts Worse

Marcus Chen

You Might Also Like

AWS Step Functions vs SQS for Task Queues

Amazon Macie for S3: Sensitive Data Discovery Guide

Smart Savings Plans: Secure Your Financial Future Today!

Stay in the loop