Challenge
Scale up of spot node group fail with AsgInstanceLaunchFailures, as it "Could not launch Spot Instances. UnfulfillableCapacity - Unable to fulfill capacity due to your request configuration. Please adjust your request and try again. Launching EC2 instance failed."
After that error, the node group is degraded and does not schedule new instances any longer.
How can I solve this, so the node group is working, once instances are available once again?
Setup
I made use of the terraform-aws-eks-blueprints-repo and build myself an EKS cluster. The cluster has the following managed node groups.
- spot - eu-central-1 - a
- spot - eu-central-1 - b
- spot - eu-central-1 - c
- ondemand - eu-central-1 - a
- ondemand - eu-central-1 - b
- ondemand - eu-central-1 - c
On top, I configured the cluster-autoscaler-priority-expander
to first use spot and then on demand.
Update 2022-05-13: I used just m5.large
and now added more types, to work around the problem. With this extended set, there seems no issue so far. I still would really much love to know how to solve this problem, as if SPOT is not available at all, my cluster would fail... which is not a good prospect.
Update 2022-05-19: I had a chat with AWS, and they claimed it is an issue which there is no solution so far. As the auto-scaling group is not "degraded" the cluster auto scaler just thinks it is. For me, this sounds like wanted barrier of entry .. so still, if someone has a solution, I would be open.