New eks node instance not able to join cluster, getting "cni plugin not initialized"

Question

I am pretty new to terraform and trying to create a new eks cluster with node-group and launch template. The EKS cluster, node-group, launch template, nodes all created successfully. However, when I changed the desired size of the node group (using terraform or the AWS management console), it would fail. No error reported in the Nodg group Health issues tab. I digged further, and found that new instances were launched by the Autoscaling group, but new ones were not able to join the cluster.

Look into the troubled instances, I found the following log by running "sudo journalctl -f -u kubelet"

an 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.612322 3168 eviction_manager.go:254] "Eviction manager: failed to get summary stats" err="failed to get node info: node "ip-10-102-21-129.us-east-2.compute.internal" not found"

Jan 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.654501 3168 kubelet.go:2427] "Error getting node" err="node "ip-10-102-21-129.us-east-2.compute.internal" not found"

Jan 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.755473 3168 kubelet.go:2427] "Error getting node" err="node "ip-10-102-21-129.us-east-2.compute.internal" not found"

Jan 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.776238 3168 kubelet.go:2352] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

Jan 27 19:32:32 ip-10-102-21-129.us-east-2.compute.internal kubelet[3168]: E0127 19:32:32.856199 3168 kubelet.go:2427] "Error getting node" err="node "ip-10-102-21-129.us-east-2.compute.internal" not found"

Looked like the issue has something to do with the cni add-ons, googled it and others suggest to check for the log inside the /var/log/aws-routed-eni directory. I could find that directory and logs in the working nodes (the ones created initialy when the eks cluster was created), but the same directory and log files do not exist in the newly launch instances nodes (the one created after the cluster was created and by changing the desired node size)

The image I used for the node-group is ami-0af5eb518f7616978 (amazon/amazon-eks-node-1.24-v20230105)

Here is what my script looks like:

resource "aws_eks_cluster" "eks-cluster" {
  name = var.mod_cluster_name 
  role_arn = var.mod_eks_nodes_role
  version = "1.24"
  
  vpc_config {
    security_group_ids = [var.mod_cluster_security_group_id]
    subnet_ids = var.mod_private_subnets
    endpoint_private_access = "true"
    endpoint_public_access = "true"
  }
}

resource "aws_eks_node_group" "eks-cluster-ng" {
  cluster_name = aws_eks_cluster.eks-cluster.name
  node_group_name = "eks-cluster-ng"  
  node_role_arn = var.mod_eks_nodes_role
  subnet_ids = var.mod_private_subnets
  #instance_types = ["t3a.medium"]
   
   
  scaling_config {
    desired_size = var.mod_asg_desired_size
    max_size = var.mod_asg_max_size
    min_size = var.mod_asg_min_size
  }
  
  
  launch_template {
    #name   = aws_launch_template.eks_launch_template.name
    id          = aws_launch_template.eks_launch_template.id
    version     = aws_launch_template.eks_launch_template.latest_version
  }
  
  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_launch_template" "eks_launch_template" {
  
  name = join("", [aws_eks_cluster.eks-cluster.name, "-launch-template"])

  vpc_security_group_ids = [var.mod_node_security_group_id]

  block_device_mappings {
    device_name = "/dev/xvda"

    ebs {
      volume_size = var.mod_ebs_volume_size 
      volume_type = "gp2"
      #encrypted   = false
    }
  }
  
  lifecycle {
    create_before_destroy = true
  }
  
  image_id = var.mod_ami_id
  instance_type = var.mod_eks_node_instance_type
  
  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"  
    http_put_response_hop_limit = 2 
  }

  user_data = base64encode(<<-EOF
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
set -ex

exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1

B64_CLUSTER_CA=${aws_eks_cluster.eks-cluster.certificate_authority[0].data}

API_SERVER_URL=${aws_eks_cluster.eks-cluster.endpoint}

K8S_CLUSTER_DNS_IP=172.20.0.10


/etc/eks/bootstrap.sh ${aws_eks_cluster.eks-cluster.name} --apiserver-endpoint $API_SERVER_URL --b64-cluster-ca $B64_CLUSTER_CA 

--==MYBOUNDARY==--\
  EOF
  )

  tag_specifications {
    resource_type = "instance"

    tags = {
      Name = "EKS-MANAGED-NODE"
    }
  }
}

Another thing I notice is that I tagged the instance Name as "EKS-MANAGED-NODE". That tag showed up correctly in nodes created when the eks cluster was created. However, any new nodes created afterward, the Name changed to "EKS-MANAGED-NODEGROUP-NODE"

I wonder if that indicates there is issue?

I checked the log confirmed that the user-data got looked at and ran when instances started up.

sh-4.2$ more user-data.log

B64_CLUSTER_CA=LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakNDQWVhZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJek1ERXlOekU 0TlRrMU1Wb1hEVE16TURFeU5E (deleted the rest)
API_SERVER_URL=https://EC283069E9FF1B33CD6C59F3E3D0A1B9.gr7.us-east-2.eks.amazonaws.com
K8S_CLUSTER_DNS_IP=172.20.0.10

/etc/eks/bootstrap.sh dev-test-search-eks-oVpBNP0e --apiserver-endpoint https://EC283069E9FF1B33CD6C59F3E3D0A1B9.gr7.us-east-2.eks.amazonaws.com --b64-cluster-ca LS0tLS 1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUMvakND...(deleted the rest)

Using kubelet version 1.24.7 true Using containerd as the container runtime true ‘/etc/eks/containerd/containerd-config.toml’ -> ‘/etc/containerd/config.toml’ ‘/etc/eks/containerd/sandbox-image.service’ -> ‘/etc/systemd/system/sandbox-image.service’ Created symlink from /etc/systemd/system/multi-user.target.wants/containerd.service to /usr/lib/systemd/system/containerd.service. Created symlink from /etc/systemd/system/multi-user.target.wants/sandbox-image.service to /etc/systemd/system/sandbox-image.service. ‘/etc/eks/containerd/kubelet-containerd.service’ -> ‘/etc/systemd/system/kubelet.service’ Created symlink from /etc/sy

I confirmed that the role being specified has all the required permission, the role is being used in other eks cluster, I am trying to create a new one based on the existing one using terraform.

I tried removing the launch template and let aws using the default one. Then new nodes have no issue joining the cluster.

I looked at my launch template script and at the registry https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/launch_template

nowhere mentioned that I need to manually add or run the cni plugin.

So I don't understand why the cni plugin was not installed automatically and why instances are not able to join the cluster.

Any help is appreciated.

Why do you want to use a custom launch template? What are you getting out of it vs the default one? In particular, the `user-data` looks problematic. — kenske, Jan 27 '23 at 21:09
I need to install other monitoring tools on the instance. It is a requirement, so I have to use custom launch template to do so. What do you see as problem in the user-data section? — BrianY, Jan 27 '23 at 22:24
The EKS bootstrap script doesn't need to be added to `user-data`, I wonder if your launch template is somehow conflicting with the built-in init script. Would it be possible to modify your `user-data` to only include the setup for the monitoring tools? — kenske, Jan 27 '23 at 23:07
I will definitely give that a try and see how it goes. Thanks Kenske. — BrianY, Jan 30 '23 at 16:07

New eks node instance not able to join cluster, getting "cni plugin not initialized"

0 Answers0