8

I'm running a pod in kubernetes, with hugepages allocated in host and hugepages defined in the pod. The kubernetes worker is in a VM. The VM (host) has huge pages allocated. The pod fails to allocate hugepages though. Application gets SIGBUS when trying to write to the first hugepage allocation.

The pod definition includes hugepages:

    securityContext:
      allowPrivilegeEscalation: true
      privileged: true
      runAsUser: 0
      capabilities:
        add: ["SYS_ADMIN", "IPC_LOCK"]
    resources:
      requests:
        intel.com/intel_sriov_netdevice : 2
        memory: 2Gi
        hugepages-2Mi: 4Gi
      limits:
        intel.com/intel_sriov_netdevice : 2
        memory: 2Gi
        hugepages-2Mi: 4Gi
    volumeMounts:
    - mountPath: /sys
      name: sysfs
    - mountPath: /dev/hugepages
      name: hugepage
      readOnly: false
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
  - name: sysfs
    hostPath:
      path: /sys
    

The VM hosting the pod has hugepages allocated:

cat /proc/meminfo | grep -i hug
AnonHugePages:         0 kB
HugePages_Total:    4096
HugePages_Free:     4096
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

The following piece of code runs fine in the VM hosting the pod, I can see the hugepages files getting created in /dev/hugepages, also the HugePages_Free counter decreases while the process is running.

#include <stdio.h>
#include <sys/mman.h>
#include <errno.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#define LENGTH (2UL*1024*1024)
#define FILE_NAME "/dev/hugepages/hugepagefile"
static void write_bytes(char *addr)
{
        unsigned long i;

        for (i = 0; i < LENGTH; i++)
                *(addr + i) = (char)i;
}
int main ()
{
   void *addr;
   int i;
   char buf[32];
   int fd;

   for (i = 0 ; i < 16 ; i++ ) {
           sprintf(buf, "%s_%d", FILE_NAME, i);
           fd = open(buf, O_CREAT | O_RDWR, 0755);
           addr = mmap((void *)(0x0UL), LENGTH, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB , fd, 0);

           printf("address returned %p \n", addr);

           if (addr == MAP_FAILED) {
                   perror("mmap ");
           } else {
                write_bytes(addr);
                //munmap(addr, LENGTH);
                //unlink(FILE_NAME);
           }
           close(fd);
   }
   while (1){}
   return 0;
}

But if I run the same code in the pod, I get a SIGBUS while trying to write to the first hugepage allocated.

Results on the VM (hosting the pod)

root@k8s-1:~# cat /proc/meminfo | grep -i hug
AnonHugePages:         0 kB
HugePages_Total:    4096
HugePages_Free:     4096
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
root@k8s-1:~# ./mmap  &
[1] 19428
root@k8s-1:~# address returned 0x7ffff7800000
address returned 0x7ffff7600000
address returned 0x7ffff7400000
address returned 0x7ffff7200000
address returned 0x7ffff7000000
address returned 0x7ffff6e00000
address returned 0x7ffff6c00000
address returned 0x7ffff6a00000
address returned 0x7ffff6800000
address returned 0x7ffff6600000
address returned 0x7ffff6400000
address returned 0x7ffff6200000
address returned 0x7ffff6000000
address returned 0x7ffff5e00000
address returned 0x7ffff5c00000
address returned 0x7ffff5a00000

root@k8s-1:~# cat /proc/meminfo | grep -i hug
AnonHugePages:         0 kB
HugePages_Total:    4096
HugePages_Free:     4080
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

Results in the pod:

Program received signal SIGBUS, Bus error.
0x00005555555547cb in write_bytes ()
(gdb) where
#0  0x00005555555547cb in write_bytes ()
#1  0x00005555555548a6 in main ()

chicks
  • 2,393
  • 3
  • 24
  • 40
emartin
  • 81
  • 1
  • 4
  • 2
    This Looks like a kubernetes issue, of perhaps I'm not configuring things right. But the file /sys/fs/cgroup/hugetlb/kubepods/hugetlb.2MB.limit_in_bytes limit is set to 0. The following makes allocations work in the pod: `echo 9223372036854771712 | sudo tee hugetlb.2MB.limit_in_bytes` – emartin Jul 29 '19 at 16:27
  • Have you tried switching OS in the node? Seems like this is supported on k8s but memory allocation can also depend on the underlying host. – yyyyahir Aug 14 '19 at 15:12
  • Got the exact same issue, thanks for your code sample btw. Digged for a while without finding anything. And finally got rid of the behavior by restarting the kubelet service... Weird – sknat Feb 27 '20 at 14:10

3 Answers3

4

This is a known problem in K8s.

The culprit is that kubelet doesn't update /sys/fs/cgroup/hugetlb/kubepods/hugetlb.2MB.limit_in_bytes upon Node Status Update which happens every 5 minutes by default. Yet it updates the node's resources correctly after enabling hugepages on the host. This creates the possibility to schedule a workload using hugepages on a node with misconfigured limits in the root cgroup.

Some time ago I made this patch to K8s, but it never got accepted. You can try to apply it to your K8s build if it's still applicable. If not I'd appreciate if somebody else rebased it and submitted again. I spent too much time trying to make it in and switched to another project.

versale
  • 61
  • 5
2

Try restarting the kubelet service

Mahadev
  • 71
  • 4
0

After configured hugepage on worker node, as kubelet restart is needed in order to inform K8S the Hugepage resource. (I had the same issue when I try this with K8S 1.16 release, perhaps now it works better?)