The EC2 instances in my AWS autoscaling group all terminate after 1-4 hours hours of running. The exact time varies, but when it happens, the entire group goes down within minutes of each other.
The scaling history description for each is simply:
At 2016-08-26T05:21:04Z an instance was taken out of service in response to a EC2 health check indicating it has been terminated or stopped.
But I haven't added any health checks. And the EC2 status checks all pass for the life of the instance.
How do I determine what this "health check" failure actually means?
Most questions around ASG termination all lead back to the load balancer, but I have no load balancer. This cluster processes batch jobs, and min/max/desired values are controlled by software based on workload backlog elsewhere in the system.
The ASG history does not indicate a scale-in event, AND the instances are also all protected from scale-in explicitly.
I tried setting the health check grace period to 20 hours to see if that at least leaves the instance up so I can inspect it, but they all still terminate.
The instances are running an ECS AMI, and ECS is running a single task, started at bootup, in a container. The logs from that task look normal, and things seem to be running happily until a few minutes before the instance vanishes.
The task is CPU intensive, but error occurs still when I just have it sleep for six hours.