AWS ECS Task Failing to Start How to Fix It

Why ECS Tasks Fail to Start

Debugging ECS task failures has gotten complicated with all the misleading advice flying around — “just redeploy it,” “check CloudWatch,” “increase your memory limits.” As someone who has spent an embarrassing number of late nights staring at the ECS console, I learned everything there is to know about why tasks refuse to start. Today, I will share it all with you.

Most failures cluster around four specific root causes. IAM execution role problems. Container image pull failures. Insufficient cluster resources. Task definition misconfiguration. Each one leaves a distinct fingerprint if you know where to look — and that’s the part most engineers skip entirely. They jump straight to restarting things or tweaking random settings. Backwards. The error message is your roadmap, and ECS logs one every single time a task dies.

So, without further ado, let’s dive in. We’ll cover how to pull the actual stopped reason from the console and CLI, how to fix IAM permission failures (and stop confusing execution roles with task roles), how to diagnose image pull errors and URI misconfigurations, and how to handle resource constraints and networking issues. By the end, you’ll have a five-point checklist to use every time a task goes sideways.

Check the Stopped Task Error in the Console First

This step should be automatic. Every time an ECS task fails, it logs a reason. Every single time. Yet I watch engineers skip it almost habitually — jumping straight to redeploying or hunting through CloudWatch logs that may not even exist if the container never actually started.

Open the ECS console. Navigate to your cluster, select the service or task, hit the Tasks tab. Find the task sitting in STOPPED or PENDING status and click into it. Scroll down to the “Stopped reason” field. That’s where the real error lives. Not in CloudWatch. Not in your application logs. Right there.

You’ll see messages like: CannotPullContainerError: Error response from daemon: pull access denied for 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app: manifest not found or CannotStartContainerError: Error response from daemon: OCI runtime create failed or Task failed to run: insufficient CPU available. These are not vague. They tell you exactly what broke. Write them down — seriously.

If you prefer the CLI, use this command:

aws ecs describe-tasks \
  --cluster your-cluster-name \
  --tasks arn:aws:ecs:us-east-1:123456789:task/your-cluster-name/abc123def456 \
  --region us-east-1

Swap in your actual cluster name, task ARN, and region. The response includes a stoppedReason field with the same error message you’d see in the console. Pipe it through jq to pull just that field:

aws ecs describe-tasks \
  --cluster your-cluster-name \
  --tasks arn:aws:ecs:us-east-1:123456789:task/your-cluster-name/abc123def456 \
  --region us-east-1 | jq '.tasks[0].stoppedReason'

That error message is your diagnostic anchor. Everything else we do flows from it. Don’t skip this step and don’t paraphrase what it says — copy the exact text.

Fix IAM Execution Role and Task Role Errors

I’m apparently someone who spent an entire afternoon confused about execution roles versus task roles early on, and that confusion cost me more than I’d like to admit. Don’t make my mistake. These are two completely different things.

But what is the execution role? In essence, it’s the IAM role that ECS itself uses to pull your container image and handle secrets and logs. But it’s much more than that — it’s the gatekeeper between your task and AWS infrastructure before your container even boots. The task role, by contrast, is what your running application uses to call AWS services like S3 or DynamoDB. Mix them up and you’ll waste your afternoon.

The execution role must have at least the AmazonECSTaskExecutionRolePolicy managed policy attached. Without it, you’ll hit CannotPullContainerError or CannotStartContainerError the moment ECS tries to pull the image or fetch a secret.

Go check your execution role in IAM. Search for the role name — usually something like ecsTaskExecutionRole, though teams name these all kinds of things. Verify the trust relationship looks exactly like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ecs-tasks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Missing that trust policy entirely? ECS can’t assume the role at all. Add it immediately and move on.

Next, confirm these policies are attached:

  • AmazonECSTaskExecutionRolePolicy (managed policy) — allows ECS to pull images and write logs
  • If you’re pulling from a private ECR repository, add AmazonEC2ContainerRegistryReadOnly or a custom policy granting ecr:GetAuthorizationToken and ecr:BatchGetImage
  • If your task definition references secrets in Secrets Manager or Parameter Store, add secretsmanager:GetSecretValue and ssm:GetParameters

The minimum policy statement for ECR access looks like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchGetImage",
        "ecr:GetDownloadUrlForLayer"
      ],
      "Resource": "*"
    }
  ]
}

After updating the role, new tasks pick up the permissions immediately. Existing failed tasks won’t restart themselves — you’ll need to update the service or manually trigger a new task run.

Fix Image Pull and Container Definition Errors

Once IAM is cleared, the next failure point is usually the image itself. Wrong URI, missing tag, unreachable registry. Each one looks slightly different in the error output.

CannotPullContainerError: Error response from daemon: pull access denied for 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app: manifest not found means the tag doesn’t exist in that repository. The image itself might be there — the specific tag you specified isn’t. That’s a meaningful distinction.

Verify the image URI in your task definition. Go to ECS, open the task definition, check the container image field. It should match one of these formats:

  • ECR: 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
  • Docker Hub: library/nginx:1.21 or your-username/your-image:v1.0
  • Private registry: registry.example.com:5000/my-app:v1.0

Then verify the image actually exists. For ECR, run this:

aws ecr describe-images \
  --repository-name my-app \
  --region us-east-1 | jq '.imageDetails[].imageTags'

That lists every tag in the repository. If yours isn’t there, the build failed or you’re referencing an old task definition revision. Rebuild, push with the correct tag, and redeploy.

Here’s a gotcha I’ve personally hit at least four times. You tag an image as latest, push it, deploy it — fine. Later you rebuild and push a new latest. Now the old task definition still says latest, but latest points somewhere completely different. I’m apparently someone who kept doing this with a Node.js service and spent 45 minutes confused each time. Use version numbers like v1.2.3 instead of relying on latest — at least if you want predictable deployments.

One more thing worth checking: networking. If your task runs in a private subnet with no NAT gateway and no VPC endpoint for ECR, it simply cannot reach the registry. The pull will time out and fail. Check the subnet’s route table — it should have either a NAT gateway route or a VPC endpoint for ECR configured. Neither exists? Add one, or move the task to a public subnet with an internet gateway attached.

Fix Resource and Networking Failures

Probably should have opened with this section, honestly. A surprising number of teams configure their clusters with undersized instances or overly tight security groups and then stare blankly at PENDING tasks wondering what went wrong.

For the EC2 launch type, ECS has to reserve CPU and memory on a real instance to place your task. If your task definition requests 512 CPU units and 1024 MB of memory — and every instance in your cluster is already at capacity — you’ll see Task failed to run: insufficient CPU available or no container instances with enough resources available. Scale up the Auto Scaling Group or swap in larger instance types. That’s the fix.

For Fargate, CPU and memory combinations are fixed. Your task definition must use one of the supported pairings — like 256 CPU with 512–2048 MB memory, or 1024 CPU with 2048–8192 MB memory. Specify an invalid combination and the task simply won’t start. Check the Fargate documentation for the exact valid combinations in your region. That’s what makes Fargate both endearing and occasionally maddening to those of us who use it daily.

Networking failures are subtler. For Fargate tasks, verify:

  • The subnets are correct and have available IP addresses remaining
  • The security group allows outbound traffic on port 443 — for pulling images, if you’re not using a VPC endpoint
  • If you’re mapping ports, port 80 or 443 must not conflict with other tasks on the same instance (EC2 launch type only)

Check the security group rules in the EC2 console. If you mapped port 8080 in your task definition and another task on the same EC2 instance already claimed port 8080, the new task can’t bind to it. That task will fail to start. The fix is straightforward — change the host port mapping or reschedule onto a different instance.

Here’s your diagnostic checklist. Bookmark this and work through it in order every time a task fails:

  1. Pull the stopped reason from the ECS console or the aws ecs describe-tasks CLI command
  2. Verify the IAM execution role has the correct trust policy and AmazonECSTaskExecutionRolePolicy attached
  3. Check the image URI in the task definition and confirm the tag actually exists in your registry
  4. Confirm resource availability — CPU, memory, and port mappings aren’t conflicting
  5. Validate networking — subnets, security groups, and VPC endpoints are all configured correctly

Most task failures resolve somewhere in the first three steps. Use the error message as your guide — it’s rarely lying to you — and you’ll rarely need to dig into steps four and five.

Marcus Chen

Marcus Chen

Author & Expert

Robert Chen specializes in military network security and identity management. He writes about PKI certificates, CAC reader troubleshooting, and DoD enterprise tools based on hands-on experience supporting military IT infrastructure.

50 Articles
View All Posts

Stay in the loop

Get the latest team aws updates delivered to your inbox.