Node process gets killed when Memory Cgroup reports OOM, when running on instances with a high RAM and CPU cores, but works with small instances

Question

When running a job as a pipeline in Gitlab Runner's K8s pod, the job gets completed successfully only when running on a small instance like m5*.large which offers 2 vCPUs and 8GB of RAM. We set a limit for the build, helper, and services containers mentioned below. Still, the job fails with an Out Of Memory (OOM) error, getting the process node killed by cgroup when running on an instance way more powerful like m5d*.2xlarge for example which offers 8 vCPUs and 32GB of RAM.

Note that we tried to dedicate high resources to the containers, especially the build one in which the node process is a child process of this and nothing changed when running on powerful instances; the node process still got killed because of OOM, each time we give it more memory, the node process consumed higher memory and so on.

Also, regarding the CPU usage, in powerful instances, the more vCPUs we gave it, the more is consumed and we noticed that it has CPU Throtelling at ~100% almost all the time, however, in the small instances like m5*.large, the CPU throttling didn't pass the 3%.

Note that we specified a maximum of memory that be used by the node process but it looks like it does not take any effect. We tried to set it to 1GB, 1.5GB and 3GB.

NODE_OPTIONS: "--max-old-space-size=1536"

Node Version

v16.19.0

Platform

amzn2.x86_64

Logs of the host where the job runs

"message": "oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=....
....
"message": "Memory cgroup out of memory: Killed process 16828 (node) total-vm:1667604kB

resources request/limits configuration

memory_request = "1Gi"
memory_limit = "4Gi"
service_cpu_request = "100m"
service_cpu_limit = "500m"
service_memory_request = "250Mi"
service_memory_limit = "2Gi"
helper_cpu_request = "100m"
helper_cpu_limit = "250m"
helper_memory_request = "250Mi"
helper_memory_limit = "1Gi"

Resource consumption of a successful job running on m5d.large

Screenshot 2022-12-15 at 14 56 39

Resource consumption of a failing job running on m5d.2xlarge

Screenshot 2022-12-15 at 14 57 40

score 0 · Answer 1 · answered Dec 28 '22 at 11:11

When a process in the container tries to consume more than the allowed amount of memory, the system kernel terminates the process that attempted the allocation, with an out of memory (OOM) error.

Check did you enable persistent journaling in your container(s)?

One way: mkdir /var/log/journal && systemctl restart systemd-journald

Other way: in ystemd/man/journald.conf.html

If not and your container uses systemd, it will log to memory with limits derived from the host RAM which can lead to unexpected OOM situations..

Also if possible you can increase the amount of RAM (clamav does use quite a bit).

If the node experiences an out of memory (OOM) event prior to the kubelet being able to reclaim memory, the node depends on the oom_killer to respond.

Node out of memory behavior is well described in Kubernetes best practices: Resource requests and limits. Adjust memory requests (minimal threshold) and memory limits (maximal threshold) in your containers.

Pods crash and OS Syslog shows the OOM killer kills the container process, Pod memory limit and cgroup memory settings. Kubernetes manages the Pod memory limit with cgroup and OOM killer. We need to be careful to separate the OS OOM and the pods OOM.

Try to use the --oom-score-adj option to docker run or even --oom-kill-disable. Refer to Runtime constraints on resources for more info.

Also refer to the similar SO for more related information.

Node process gets killed when Memory Cgroup reports OOM, when running on instances with a high RAM and CPU cores, but works with small instances

1 Answers1