AWS ECS error: Task failed ELB health checks in Target group

Question

I am using cloud formation template to build the infrastructure (ECS fargate cluster). Template executed successfully and stack has been created successfully. However, task has failed with the following error:

Task failed ELB health checks in (target-group arn:aws:elasticloadbalancing:eu-central-1:890543041640:targetgroup/prc-service-devTargetGroup/97e3566c8b307abf)

I am not getting what and where to look for this to troubleshoot the issue. as it is fargate cluster, I am not getting how to login to container and execute some health check queries to debug further.

Can someone please help me to guide further on this and help me? Due to this error, I am not even able to access my web app. As ALB won't route the traffic if it is unhealthy.

What I did

After some googling, I found this post: https://aws.amazon.com/premiumsupport/knowledge-center/troubleshoot-unhealthy-checks-ecs/

However, I guess, this is related to EC2 compatibility in fargate. But in my case, EC2 is not there.

If you feel, I can paste the entire template as well.

please help

These types of questions are actually awesome, because the undocumented (which most of the times are not documentable to begin with) aspects of the services are being very well documented... — Romeo Sierra, Nov 17 '21 at 11:34
""> I am not getting how to login to container and execute some health check queries to debug further. "" Curios, what checks would you do if you could? I am running on EC2 atm getting the same error. Got left a CF template have the app setup on one AWS account but i'm trying to split the production app into it's own AWS account and the service just keeps rebooting. IAM role execution role, task role, task definition container image are all the same. "Task failed ELB health checks in (target-group arn:aws:elasticloadbalancing:eu-west-2:***:targetgroup/stage-quotation/***)" — Sigex, Jan 22 '22 at 14:57

score 34 · Accepted Answer · answered Feb 06 '19 at 12:13

34

This is resolved. It was the issue with the following points:

Docker container port mapping with host port were incorrect
ALB health check interval time was very short. Due to that, ALB was giving up immediately, not waiting for docker container to up and running properly.

after making these changes, it worked properly

answered Feb 06 '19 at 12:13

user2315104

2,378
7
35
54

4

Glad to know that your issue was resolved; but, your own question was how to debug when a health check fails and there's not much to go by in the link. Have you by any chance found any way of accessing the docker logs like aws has on EB for example? Would be great if you updated your answer with any new info you have. Thanks. – AlexanderF Apr 05 '19 at 17:34
2

If you are deploying through ECS, in AWS console there are some information in Cluster > Tasks. Choose the stopped tasks you may see the error message. May be something like "service ...-service (instance 10.0.0.29) (port 8080) is unhealthy in target-group ...-service due to (reason Request timed out)". – ozooxo May 21 '19 at 11:04
35

Can you please elaborate on the "Docker container port mapping with host port were incorrect"? What exactly was wrong, where and how did you fix it? – E. Muuli Jun 04 '19 at 14:27
3

+1 ***ALB health check interval time was very short*** was my takeaway. I have seen something that had only 60 seconds set, which was taking a lot longer to complete startup due to a number of Kafka topics etc. to be set up. – Romeo Sierra Nov 17 '21 at 11:31
I face the same issue and was able to solve the issue by increasing the interval time. – Oshada Aug 27 '22 at 15:41
I spent hours on this error and my issue was 'Docker container port mapping with host port were incorrect'. When you deploy your image in container make sure port is same as present in Task definition and ECS service. – Ali Hasan Nov 20 '22 at 13:19

Rene B. · Answer 2 · 2019-12-02T14:19:10.267

There are quite a few of different possible reasons for this issue, not only the open ports:

Improper IAM permissions for the ecsServiceRole IAM role
Container instance security group Elastic Load Balancing load
balancer not configured for all Availability Zones Elastic Load
Balancing load balancer health check misconfigured
Unable to update the service servicename: Load balancer container name or port changed in task definition

Therefore AWS created an own website in order to address the possibilities of this error:

https://docs.aws.amazon.com/en_en/AmazonECS/latest/developerguide/troubleshoot-service-load-balancers.html

Edit: in my case the health check code of my application was different. The default is 200 but you can also add a range such as 200-499.

score 8 · Answer 3 · edited Jul 29 '21 at 15:03

8

Let me share my experience.

In my case everything was correct, except the host on which the server listens, it was localhost which makes the server not reachable from the outside world and respectively the health check didn't work. It should be 0.0.0.0 or empty in some libraries.

edited Jul 29 '21 at 15:03

Dharman

30,962
25
85
135

answered Jul 29 '21 at 14:57

nacholibre

3,874
3
32
35

score 4 · Answer 4 · answered Nov 13 '19 at 06:48

4

I got this error message because the security group between the ECS service and the load balancer target group was only allowing HTTP and HTTPS traffic.

Apparently the health check happens over some other port and or protocol as updating the security group to allow all traffic on all ports (as suggested at https://docs.aws.amazon.com/AmazonECS/latest/userguide/create-application-load-balancer.html) made the health check work.

answered Nov 13 '19 at 06:48

tschumann

2,776
3
26
42

I had to add the port that the app was running on in ECS to the security group. – jones-chris Jan 07 '20 at 15:38
I had configured elb health check on https only. So I had to allocate more resources to run my app. but when I changed configuration to http, I could run with less resource. – Meraj al Maksud Jul 12 '21 at 19:32

score 3 · Answer 5 · answered Nov 02 '19 at 05:02

I had this exact same problem. I was able to get around the issue by:

navigate to EC2 service
then select Target Group in the side panel
select your target group for your load balancer
select the health check tab
make sure the health check for your EC2 instance is the same as the health check in the target group. This will tell your ELB to route its traffic to this endpoint when conducting its health check. In my case my health check path was /health.

score 1 · Answer 6 · answered Feb 24 '22 at 05:09

In my case, ECS Fargate orchestration of the docker container functionality as a service and not a Web app or API. The service is that is not listening to any port (eg: Schedule corn/ActiveMQ message consumer ...etc).

In order words, it is a client and not a server node. So I made to listen to localhost for health check only...

All I added health check path in Target Group to -

And below code in index.ts -

import express from 'express';

const app = express();
const port = process.env.PORT || 8080;

//Health Check

app.get('/__health', (_, res) => res.send({ ok: 'yes' }));
app.listen(port, () => {
  logger.info(`Health Check: Listening at http://localhost:${port}`);
});

score 1 · Answer 7 · answered Jul 29 '22 at 12:54

My case was a React application running on FARGATE mode.

The first issue was that the Docker image was built over NodeJS "serving" it with:

CMD npm run start # react-scripts start

Besides that's not a good practice at all, it requires a lot of resources (4GB & 2vCPU were not enough), and because of that, the checks were failing. (this article mentions this as a probable cause)

To solve the previous issue, we modify the image as a multistage build with NodeJS for the building phase + NGINX for serving the content. Locally that was working great, but we haven't realized that the default port for NGINX is 80, and you can not use a different host and container port on FARGATE with awsvpc network mode.

To troubleshoot it, I launched an EC2 instance with the right Security Groups to connect with the FARGATE targets on the same port the Load Balancer was failing to perform a Health Check. I was able to execute curl's commands against other targets, but with this unhealthy target (constantly being recycled) I received an instant Connection refused response. It wasn't a timeout, which told me that the target was not able to manage that request because it was not listening to that port. Then I realized that my container was expecting traffic on port 80 and my application was configured to work on a 3xxx port.

The solution here was to modify the default configuration of NGINX to listen to the port we wanted, re-build the image and re-launch the service.

score 1 · Answer 8 · answered Apr 18 '23 at 12:02

1

Some possible solutions for ECS

Verify security group inbound port is allow for ECS instance traffic.
Verify container networking and port mapping.
Verify target group health check endpoint. it should be correct and give 200 status.

answered Apr 18 '23 at 12:02

Satish Mali

61
1
3

Also verify if the load balancer security group allows outbound traffic, otherwise it won't be able to hit the target instances for the health check – Rafael Matsumoto May 03 '23 at 18:22

score 0 · Answer 9 · answered Nov 19 '19 at 21:22

As mentioned by tschumann above, check the security group around the ECS cluster. If using Terraform, allow ingress to all docker ephemeral ports with something like below:

resource "aws_security_group" "ecs_sg" {
  name    = "ecs_security_group"
  vpc_id  = "${data.aws_vpc.vpc.id}"

}

resource "aws_security_group_rule" "ingress_docker_ports" {
  type              = "ingress"
  from_port         = 32768
  to_port           = 61000
  protocol          = "-1"
  cidr_blocks       = ["${data.aws_vpc.vpc.cidr_block}"]
  security_group_id = "${aws_security_group.ecs_sg.id}"
}

score 0 · Answer 10 · answered May 18 '21 at 19:31

Possibly helpful for someone.. our target group health check path was set to /, which for our services pointed to Swagger and worked well. After updating to use Springfox instead of manually generating swagger.json, / now performs a 302 redirect to /swagger-ui.html, which caused the health check to fail. Since this was for a Spring Boot service we simply pointed the health check path in the target group to /health instead (OOTB Spring status page).

score 0 · Answer 11 · answered Jul 03 '22 at 22:20

Solution is partial correct in response 'iravinandan', but in last part of your nodejs router just simple add status(200) and that's it. Or you can set your personal status clicking on advance tab, on end of the page.

app.get('/__health', (request, response) => response.status(200).end(""));

More info here: enter link description here

Regards

M. Osama · Answer 12 · 2022-08-25T18:35:06.390

I had the same issue with deploying a java springboot app on ACS running as a fargate. There were 3 issues which I had to address to fix the problem, if this can help others in future.

The container was running on port 8080 (because of tomcat), so the ELB, target group and the two security groups (one with ELB and one with ECS) must allow 8080 in their inbounds rules. Also the task set up had to be revised to change the container to map at 8080.
The port on target group health check section (advance settings) had to be explicitly changed to 8080 instead of 80 as the default.
I had to create a dummy health check path in the application because pinging the root of the app at "/" was resulting in a 302 error code.

Hope this helps.

score 0 · Answer 13 · edited Sep 30 '22 at 10:51

I have also faced the same issue while using the AWS Fargate.

Here are some possible solutions to try:

First Check the Security group of Service that Attached has outbound and Inbound rules in place.
If you are using the Loadbalancer and pointing out to target group then you must enable the docker container port on security group and attached the inbound traffic only coming from the ALB security group 3)Also check the healthcheck endpoint that we are assigning to target group are there any dependanies it should return only 200 status repsonse / what we have specifed in target group

score 0 · Answer 14 · answered Jan 11 '23 at 13:34

0

In my case it was a security group rule which allowed connections only from a certain IP, and this was blocking healthchecks from LB. I added VPC's cidr as another rule to the security group and then it worked.

answered Jan 11 '23 at 13:34

Kirill G.

910
1
14
24

score 0 · Answer 15 · answered Mar 01 '23 at 14:23

I follwed the aws's provided blogs and my fix was ping path was incorrectly configured in LB w.r.t in apllication.

https://docs.aws.amazon.com/AmazonECS/latest/userguide/troubleshoot-service-load-balancers.html

https://aws.amazon.com/premiumsupport/knowledge-center/ecs-fargate-health-check-failures/

score -1 · Answer 16 · answered Jul 29 '22 at 13:25

-1

On my case, my ECS Fargate service does not need load balancer so I've removed "Load Balancer" and "Security Group" then it works.

answered Jul 29 '22 at 13:25

Keval Gangani

1,326
2
14
28

AWS ECS error: Task failed ELB health checks in Target group

16 Answers16