11

We are running ECS as container orchestration layer for more than 2 years. But there is one problem which we are not able to figure out the reason for, In few of our (node.js) services we have started observing errors in ECS events as

service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)

This causes our dependent service to start experiencing 504 gateway timeout which impacts them in big way.

  1. Upgraded Docker storage driver from devicemapper to overlay2

  2. We increased the resources for all ECS instances including CPU, RAM and EBS storage as we saw in few containers.

  3. We increase health check grace period for the service from 0 to 240secs

  4. Increased KeepAliveTimeout and SocketTimeout to 180 secs

  5. Enabled awslogs on containers instead of stdout, but there was no unusual behavior

  6. Enabled ECSMetaData at container and pipelined all information in our application logs. This helped us in looking all the logs for problematic container only.

  7. Enabled container insights for better container level debugging

Out of this things which helped the most if upgrading devicemapper to overlay2 storage driver and increasing healthcheck grace period.

The amount of errors have come down amazingly with these two but still we are getting this issue once a while.

We have seen all the graphs related to instance and container which went down below are the logs for it:

ECS container insights logs for victim container :

Query :

fields CpuUtilized, MemoryUtilized, @message
| filter Type = "Container" and EC2InstanceId = "i-016b0a460d9974567" and TaskId = "dac7a872-5536-482f-a2f8-d2234f9db6df"

Example Logs answered :

{
"Version":"0",
"Type":"Container",
"ContainerName":"example-service",
"TaskId":"dac7a872-5536-482f-a2f8-d2234f9db6df",
"TaskDefinitionFamily":"example-service",
"TaskDefinitionRevision":"2048",
"ContainerInstanceId":"74306e00-e32a-4287-a201-72084d3364f6",
"EC2InstanceId":"i-016b0a460d9974567",
"ServiceName":"example-service",
"ClusterName":"example-service-cluster",
"Timestamp":1569227760000,
"CpuUtilized":1024.144923245614,
"CpuReserved":1347.0,
"MemoryUtilized":871,
"MemoryReserved":1857,
"StorageReadBytes":0,
"StorageWriteBytes":577536,
"NetworkRxBytes":14441583,
"NetworkRxDropped":0,
"NetworkRxErrors":0,
"NetworkRxPackets":17324,
"NetworkTxBytes":6136916,
"NetworkTxDropped":0,
"NetworkTxErrors":0,
"NetworkTxPackets":16989
}

None of logs were having CPU and Memory utilised ridiculously high.

We stopped getting responses from the victim container at let's say t1, we got errors in dependent services at t1+2mins and container was taken away by ECS at t1+3mins

Our health check configurations are below :

Protocol HTTP
Path  /healthcheck
Port traffic port
Healthy threshold  10
Unhealthy threshold 2
Timeout  5
Interval 10
Success codes 200

Let me know if you need any more information, I will be happy to provide it. Configurations which we are running are :

docker info
Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 6
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.14.138-89.102.amzn1.x86_64
Operating System: Amazon Linux AMI 2018.03
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 30.41GiB
Name: ip-172-32-6-105
ID: IV65:3LKL:JESM:UFA4:X5RZ:M4NZ:O3BY:IZ2T:UDFW:XCGW:55PW:D7JH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

There should some indication about resource contention or service crashing or genuine network failure to explain all this. But as mentioned there was nothing which we got to know caused any issue.

mohit3081989
  • 441
  • 6
  • 13
  • seem like issue with issue agent apart from my answer https://github.com/aws/amazon-ecs-agent/issues/1872 – Adiii Sep 23 '19 at 11:18

2 Answers2

3

Your steps from 1 to 7 almost no thing do with the error.

service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)

The error is very clear, you ECS service is not reachable to Load balancer health check.

Target Group Unhealthy

When this is the case, go straight and check the container SG, Port, application status or health status code.

Possible reason

  • There might be the case, there is no route Path /healthcheck in the backend service
  • The status code from /healthcheck is not 200
  • Might be the case that target port is invalid, configure it correctly, if an application running on port 8080 or 3000 it should be 3000 or 8080
  • The security group is not allowing traffic on the target group
  • Application is not running in the container

These are the possible reason when there is a timeout from health check.

Adiii
  • 54,482
  • 7
  • 145
  • 148
  • 1
    Hi @adiii, these all are standard things to check and try. I have made it clear it's happening once a while which means it's working 99.9% of time which rules out all the Possible reasons you have mentioned. I am very much aware of these, and hence the question. Overlay2 does solves problem to a large extent because new containers sometime gets stuck as device mapper is not performant in managing the docker container storage. Check https://github.com/moby/moby/issues/20401 for more detail. – mohit3081989 Sep 23 '19 at 10:23
  • Yes I agree but the error is from health check. may be the process crashing and consuming the CPU – Adiii Sep 23 '19 at 10:24
  • are you able to ssh your ec2 isntance? – Adiii Sep 23 '19 at 10:26
  • As I said, I have seen all of that, shared the container insights too, there is nothing unusual. I saw, my own application logs, container logs through awslogs on cloudwatch and container insights, none of them actually directed me to anything which pinpoint to the problem. This is a very unusual behaviour. All of my application and container logs shown that all the APIs served within small period of time before the container going down. If healthchek was the issue, ALB and ECS scheduler shouldn't have let container run for 30 mins, which you can check from my healthcheck settings. – mohit3081989 Sep 23 '19 at 10:28
  • Yes, I am very much able to SSH my instance, it was case before the upgrade change of storage driver to overlay2 when I wasn't able to because there were multiple containers which were going down. After overlay2 changes only 1 container at a time goes down and I am able to login to my Ec2 instance. – mohit3081989 Sep 23 '19 at 10:30
  • 1
    do one thing more `Unhealthy threshold 10` instance the unhealth threshold sometime application thorw badegateway due to load or some other reason so it should be engouth to handle that cases, and decrease Healthy threshold 2 – Adiii Sep 23 '19 at 10:32
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/199835/discussion-between-adiii-and-mohit3081989). – Adiii Sep 23 '19 at 10:32
0

I faced the same issue of ( Reason request timeout ). I managed to solve it by updating my security-group inbound rules. Currently, there was no rule defined in Inbound rules so I add general allow-all traffic for the ipv4 rule for the time being because I was in development at that time.

enter image description here

Shahzaib Butt
  • 31
  • 1
  • 6