5

Challenge

Scale up of spot node group fail with AsgInstanceLaunchFailures, as it "Could not launch Spot Instances. UnfulfillableCapacity - Unable to fulfill capacity due to your request configuration. Please adjust your request and try again. Launching EC2 instance failed."

After that error, the node group is degraded and does not schedule new instances any longer.

How can I solve this, so the node group is working, once instances are available once again?

Image of error in EKS web interface

Setup

I made use of the terraform-aws-eks-blueprints-repo and build myself an EKS cluster. The cluster has the following managed node groups.

  • spot - eu-central-1 - a
  • spot - eu-central-1 - b
  • spot - eu-central-1 - c
  • ondemand - eu-central-1 - a
  • ondemand - eu-central-1 - b
  • ondemand - eu-central-1 - c

On top, I configured the cluster-autoscaler-priority-expander to first use spot and then on demand.

Update 2022-05-13: I used just m5.large and now added more types, to work around the problem. With this extended set, there seems no issue so far. I still would really much love to know how to solve this problem, as if SPOT is not available at all, my cluster would fail... which is not a good prospect.

Update 2022-05-19: I had a chat with AWS, and they claimed it is an issue which there is no solution so far. As the auto-scaling group is not "degraded" the cluster auto scaler just thinks it is. For me, this sounds like wanted barrier of entry .. so still, if someone has a solution, I would be open.

  • I'm also getting the exact same "Unable to fulfill capacity" error in case of one of my ASGs for spot instances, also in eu-central-1. It's not clear what to change to fix this. Maybe an AWS service degradation? – nichoio May 11 '22 at 16:56
  • And also you might want to add the EC2 instance type(s) which fail for you. – nichoio May 11 '22 at 17:09
  • I updated the message regarding instance type. –  May 13 '22 at 11:01
  • Hitting similar issue at around the same time frame. The AWS Console should provide information whether it's AWS side issue or customer side issue, and whatever guidance on how to address it. – wxh Jun 09 '22 at 14:45
  • Is there any way to monitor the degraded status of Node groups in cloud watch? – sachin_ur Oct 21 '22 at 07:14
  • Not that I know but maybe you could trigger events with CloudWatch Insights –  Oct 21 '22 at 14:20
  • @griddev I get the same error message through the aws console although I added quite a lot of instance types. It seems like the status is not udated anymore and stys in "Degraded" status. Is this just a visualization bug and the node group is still working? Did you find a solution for this? – PeteMac88 Feb 05 '23 at 10:29
  • It stays in this way forever and never takes a fresh spot instance ... shi... –  Feb 07 '23 at 13:23

2 Answers2

0

According to AWS documentation :

https://docs.aws.amazon.com/eks/latest/userguide/managed-node-groups.html

To maximize the availability of your applications while using Spot Instances, we recommend that you configure a Spot managed node group to use multiple instance types. We recommend applying the following rules when using multiple instance types:

Within a managed node group, if you're using the Cluster Autoscaler, we recommend using a flexible set of instance types with the same amount of vCPU and memory resources.

And

https://aws.amazon.com/premiumsupport/knowledge-center/eks-spot-instance-best-practices/

For example, for a m5.large (2 vCPU/8 GiB RAM) instance type, add ones with the same vCPU and RAM values, such as m5a.large, m5n.large, and m4.large.

Selected instances types should have same vCPU & RAM values.

ob_dev
  • 2,808
  • 1
  • 20
  • 26
  • Sorry, but how does this help me solve the issue? The problem is if there are no spot instances available they are degraded forever.... even its a pool. –  Jul 22 '22 at 13:42
0

In Spot EC2 AutoScalingGroup Edit Instance type requirements section and add secondary instances with the same CPU and RAM. Also, Set Allocation Strategies Prioritize instance types and Capacity rebalancing. enter image description here

enter image description here