How do I find the cause of an EC2 autoscaling group "health check" failure? (no load balancer involved)

Question

The EC2 instances in my AWS autoscaling group all terminate after 1-4 hours hours of running. The exact time varies, but when it happens, the entire group goes down within minutes of each other.

The scaling history description for each is simply:

At 2016-08-26T05:21:04Z an instance was taken out of service in response to a EC2 health check indicating it has been terminated or stopped.

But I haven't added any health checks. And the EC2 status checks all pass for the life of the instance.

How do I determine what this "health check" failure actually means?

Most questions around ASG termination all lead back to the load balancer, but I have no load balancer. This cluster processes batch jobs, and min/max/desired values are controlled by software based on workload backlog elsewhere in the system.

The ASG history does not indicate a scale-in event, AND the instances are also all protected from scale-in explicitly.

I tried setting the health check grace period to 20 hours to see if that at least leaves the instance up so I can inspect it, but they all still terminate.

The instances are running an ECS AMI, and ECS is running a single task, started at bootup, in a container. The logs from that task look normal, and things seem to be running happily until a few minutes before the instance vanishes.

The task is CPU intensive, but error occurs still when I just have it sleep for six hours.

By default, without an ELB, the ASG will only use instance status checks. However the actual message you are getting "an instance was taken out of service in response to a EC2 health check indicating it has been terminated or stopped" sounds more like the OS on the instance shutdown or somebody (or some process) initiated a stop or terminate command. Are these spot instances? — Mark B, Aug 26 '16 at 19:43
They are spot instances. Many of these _are_ listed as terminated by price! I didn't think to check the spot requests given the message. Is this normally how a spot price termination appears in the ASG history? — Scott Smith, Aug 26 '16 at 22:54
@mark-b yeah all of my "health check" failures correlate with the spot price terminations. If you can re-post your answer I'll flag it as correct. — Scott Smith, Aug 29 '16 at 19:09

kenorb · Answer 1 · 2019-03-28T00:23:08.697

Here are few suggestions:

To see why instance was terminated, in EC2's Instance list select terminated instance, and select Get System Log in Instance Settings (menu), then scroll down to the bottom to see any obvious issues. The logs are kept for a while after instance is terminated.
In ECS cluster within your active service, check Events tab for any messages.
In Target Group section, verify Health checks and Targets (Registered targets and their Status, and Health of the Availability Zones.

To modify health check settings for a target group using the AWS Console, choose Target Groups, and edit Health checks.
In ASG (EC2's Auto Scaling group), check Details (for Termination Policies), Activity History (for termination messages), Instances (for their Health Status), Scheduled Actions and Scaling Policies.
Check CloudWatch for any available logs.
Check CloudTrail for any suspicious events.
Verify that ECS agents are connected: Why is my Amazon ECS agent listed as disconnected?
Check also: Health Checks for Your Target Groups and Amazon ECS Troubleshooting.
For more suggestions, check: terraform-ecs. Registered container instance is showing 0

score 4 · Answer 2 · answered Aug 29 '16 at 19:47

4

By default, without an ELB, the ASG will only use instance status checks. However the actual message you are getting "an instance was taken out of service in response to a EC2 health check indicating it has been terminated or stopped" sounds more like the OS on the instance shutdown or somebody (or some process) initiated a stop or terminate command. Are these spot instances? This is what you will see if spot instances are terminated.

answered Aug 29 '16 at 19:47

Mark B

183,023
24
297
295

can we find the actual reason for the same. I have a higher spot bidding price when the instance was terminated. Probably, there were no instances present in the region. But can we find the exact information somewhere? – user1 Mar 26 '20 at 14:57

How do I find the cause of an EC2 autoscaling group "health check" failure? (no load balancer involved)

2 Answers2

Linked