How can you make Kubernetes gracefully handle excessive memory usage instead of killing pods or crashing nodes?

Question

I'm running a data pipeline with a few thousand tasks on Kubernetes. Most tasks take about 800MB RAM, but some take 6GB+. I'd like to allow for the occasional task with high RAM usage while also using resources efficiently.

To start, I've set my k8s pod resource request and limit to 8GB. This is wasteful, since it means most of the time my nodes (m6i.4xlarge EC2s) run fewer pods than they otherwise could. And worse it's still a little brittle; if a pod comes along and tries to use 10GB, the pod is OOMKilled. But without setting a resource request/limit, k8s will schedule too many pods which eat up all the node's RAM and it crashes (taking pod logs with it).

If I run the same pipeline on local hardware with similar overall RAM, it's fine. This may be because the OS has swap available, and when a couple pods use excessive RAM they just page and get slow but don't crash. But I want to move to k8s to scale horizontally (allow 2-3x the number of tasks to run in parallel).

Would it be a good solution to add swap to my EC2s? Is swap support in k8s (added in 2021 but still has tickets on the feature tracker) reasonably stable?

Or is there a simpler way to make k8s more robust in the face of variable RAM usage? (For example, tell k8s to only schedule new pods if there's a buffer of 10GB RAM free on the node?)

What behavior are you hoping for? As you've described the problem it does seem nearly inevitable that a node will run out of memory; what should happen then? — David Maze, Jan 06 '23 at 12:20
I'm kinda of asking what should/could happen then. But the two options I can see are (a) somehow adjust pod scheduling to avoid the node running out of memory, such as by reserving some buffer so a couple pods can overrun their requested amount, or (b) page so that running out of memory isn't fatal, like other OSes do. — markfickett, Jan 09 '23 at 15:14

LeoChen · Answer 1 · 2023-01-06T02:59:34.520

I have some suggestions for you to refer： 1、Do set the resource request and Don't set the resource limit.In this way, when you ran your pod， it will not take up the 8GB limit at first and you will have enough memory for other Pods.But it can not solve your application whose memory reached to 10GB and killed by OOM. 2、Using the NodeSwap. You can use the feature gate in K8S, but I don't test before , maybe you need to do a lot of test.https://kubernetes.io/docs/concepts/architecture/nodes/ 3、There is another way to adjust your pod resource when you don't restart your running pod：https://partners-intl.aliyun.com/help/en/container-service-for-kubernetes/latest/use-resource-controller-to-dynamically-modify-the-upper-limit-of-resources-for-a-pod

Hope these are helping for you

Thanks for the response. For NodeSwap, I have read about it, but it would be helpful to understand how stable it is. Either personal experience using it in production, or some statement from Kubernetes about what's working well / likely failure modes of this feature, would be useful for deciding if it's an appropriate approach now. — markfickett, Jan 09 '23 at 15:16

score 0 · Accepted Answer · answered Jan 13 '23 at 15:21

I was able to get a ManagedNodeGroup working with a custom LaunchTemplate that sets up swap in Python. Below is what's working for me.

I was able to set up a swap file on the EC2 instance, and start the kubelet in a way that it would allow swap usage. However I wasn't able to set the config option for swapBehavior, the key doesn't seem to be recognized by the kubelet on EKS (nor is the NodeSwap feature gate) despite documentation saying it should be.

$ pulumi about
CLI
Version      3.46.1
Go Version   go1.19.2
Go Compiler  gc

Plugins
NAME        VERSION
aws         5.7.2
eks         0.42.7
honeycomb   0.0.11
kubernetes  3.23.1
python      3.10.8

_aws_account_id = aws.get_caller_identity().account_id

_K8S_VERSION = "1.23"  # latest visible in above version of pulumi-eks

_NODE_ROOT_VOLUME_SIZE_GIB = 60
# Script to run on EKS nodes as root before EKS bootstrapping (which starts the kubelet)
#
# Make a 40GB swap file. This is a gues at allowing a few pods to overrun their
# requested RAM significantly.
# https://stackoverflow.com/questions/17173972/how-do-you-add-swap-to-an-ec2-instance
#
# Enable swap usage in the kubeconfig, following editing commands used in the
# bootstrap script.
# https://github.com/awslabs/amazon-eks-ami/blob/master/files/bootstrap.sh
# https://aws.amazon.com/premiumsupport/knowledge-center/eks-worker-nodes-image-cache/
# https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/
# This user data must be in mime format when passed to a launch template.
# https://docs.aws.amazon.com/eks/latest/userguide/launch-templates.html
#
# From MNG launch template docs:
# "your user data is merged with Amazon EKS user data required for nodes to join the
# cluster. Don't specify any commands in your user data that starts or modifies kubelet."
# Inspecting instance user data shows this and the original user data in separate MIME
# parts, both in the user data with this 1st.
#
# The swapBehavior isn't recognized by AWS kubelet. Docs say it requires
# featureGates.NodeSwap=true, but kubelet also doesn't recognize the feature.
# jq adds quotes around the "swapBehavior" key.
# It seems like the behavior defaults to limited swap: pods are killed at their
# resource limit, regardless of swap availability/usage.
# TODO set UnlimitedSwap if/when possible on AWS, using:
# echo "$(jq ".memorySwap={swapBehavior:\"UnlimitedSwap\"}" $KUBELET_CONFIG)" > $KUBELET_CONFIG
_NODE_USER_DATA_ADD_SWAP_AND_ENABLE_IN_KUBELET_CONFIG = r"""#!/bin/bash
set -e

# Use fallocate which is much faster than dd (essentially instant) since we do not
# care about the initial contents of the file.
fallocate -l 40G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo "/swapfile swap swap defaults 0 0" >> /etc/fstab

KUBELET_CONFIG=/etc/kubernetes/kubelet/kubelet-config.json
cp $KUBELET_CONFIG $KUBELET_CONFIG.orig
echo "$(jq ".failSwapOn=false" $KUBELET_CONFIG)" > $KUBELET_CONFIG
"""


_USER_DATA_MIME_HEADER = """MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"

--//
Content-Type: text/x-shellscript; charset="us-ascii"
"""


_USER_DATA_MIME_FOOTER = """

--//--
"""


def _wrap_and_encode_user_data(script_text: str) -> str:
    mime_encapsulated = _USER_DATA_MIME_HEADER + script_text + _USER_DATA_MIME_FOOTER
    encoded_bytes = base64.b64encode(mime_encapsulated.encode())
    return encoded_bytes.decode("latin1")


def _define_cluster_and_get_provider() -> Tuple[eks.Cluster, k8s.Provider]:
    # https://www.pulumi.com/docs/guides/crosswalk/aws/eks/
    # https://www.pulumi.com/registry/packages/eks/api-docs/cluster/#cluster

    # Map AWS IAM users to Kubernetes internal RBAC admin group. Mapping individual
    # users avoids having to go from a group to a role with assume-role policies.
    # Kubernetes has its own permissions (RBAC) system, with predefined groups for
    # common permissions levels. AWS EKS provides translation from IAM to that, but we
    # must explicitly map particular users or roles that should be granted permissions
    # within the cluster.
    #
    # AWS docs: https://docs.aws.amazon.com/eks/latest/userguide/add-user-role.html
    # Detailed example: https://apperati.io/articles/managing_eks_access-bs/
    # IAM groups are not supported, only users or roles:
    #     https://github.com/kubernetes-sigs/aws-iam-authenticator/issues/176
    user_mappings = []
    for username in TEAM_MEMBERS:
        user_mappings.append(
            eks.UserMappingArgs(
                # AWS IAM user to set permissions for
                user_arn=f"arn:aws:iam::{_aws_account_id}:user/{username}",
                # k8s RBAC group from which this IAM user will get permissions
                groups=["system:masters"],
                # k8s RBAC username to create for the user
                username=username,
            )
        )

    node_role = _define_node_role(EKS_CLUSTER_NAME)

    cluster = eks.Cluster(
        EKS_CLUSTER_NAME,
        name=EKS_CLUSTER_NAME,
        version=_K8S_VERSION,
        # Details of VPC usage for EKS:
        # https://docs.aws.amazon.com/eks/latest/userguide/network_reqs.html
        vpc_id=_CLUSTER_VPC,
        subnet_ids=_CLUSTER_SUBNETS,
        # OpenID Connect Provider maps from k8s to AWS IDs.
        # Get the OIDC's ID with:
        # aws eks describe-cluster --name <CLUSTER_NAME> --query "cluster.identity.oidc.issuer" --output text
        create_oidc_provider=True,
        user_mappings=user_mappings,
        skip_default_node_group=True,
        instance_role=node_role,
    )
    # Export the kubeconfig to allow kubectl to access the cluster. For example:
    #    pulumi stack output my-kubeconfig > kubeconfig.yml
    #    KUBECONFIG=./kubeconfig.yml kubectl get pods -A
    pulumi.export(f"my-kubeconfig", cluster.kubeconfig)

    # Work around cluster.provider being the wrong type for Namespace to use.
    # https://github.com/pulumi/pulumi-eks/issues/662
    provider = k8s.Provider(
        f"my-cluster-provider",
        kubeconfig=cluster.kubeconfig.apply(lambda k: json.dumps(k)),
    )

    # Configure startup script and root volume size to allow for swap.
    #
    # Changing the launch template (or included user data script) will cause the
    # ManagedNodeGroup to replace nodes, which takes 10-15 minutes.
    launch_template = aws.ec2.LaunchTemplate(
        f"{EKS_CLUSTER_NAME}-launch-template",
        # Set the default device's size to allow for swap.
        block_device_mappings=[
            aws.ec2.LaunchTemplateBlockDeviceMappingArgs(
                device_name="/dev/xvda",
                ebs=aws.ec2.LaunchTemplateBlockDeviceMappingEbsArgs(
                    volume_size=_NODE_ROOT_VOLUME_SIZE_GIB,
                ),
            ),
        ],
        user_data=_wrap_and_encode_user_data(
            _NODE_USER_DATA_ADD_SWAP_AND_ENABLE_IN_KUBELET_CONFIG
        ),
        # The default version shows up first in the UI, so update it even though
        # we don't really need to since we use latest_version below.
        update_default_version=True,
        # Other settings, such as tags required for the node to join the group/cluster,
        # are filled in by default.
    )

    # The EC2 instances that the cluster will use to execute pods.
    # https://www.pulumi.com/registry/packages/eks/api-docs/managednodegroup/
    eks.ManagedNodeGroup(
        f"{EKS_CLUSTER_NAME}-managed-node-group",
        node_group_name=f"{EKS_CLUSTER_NAME}-managed-node-group",
        cluster=cluster.core,
        version=_K8S_VERSION,
        subnet_ids=_CLUSTER_SUBNETS,
        node_role=node_role,
        instance_types=["r6i.2xlarge"],
        scaling_config=aws.eks.NodeGroupScalingConfigArgs(
            min_size=1,
            desired_size=2,
            max_size=4,
        ),
        launch_template={
            "id": launch_template.id,
            "version": launch_template.latest_version,
        },
    )

    return cluster, provider

How can you make Kubernetes gracefully handle excessive memory usage instead of killing pods or crashing nodes?

2 Answers2