1

We run on-premise small K8s cluster (based on RKE stack). 1x etcd/control node, 2x worker nodes. Components are:

  • OS: Centos 7
  • Docker version: 19.3.9
  • K8s: 1.17.2

Other, important fact: we're using Rook-Ceph storage cluster on both worker nodes (rook: v1.2.4, ceph version 14.2.7).

When one of OS mounts run into 90%+ usage (for example: /var), K8s is reporting "Disk Pressure", disables node and it's OK. But when this happens, the CPU usage start growing up to dozens (for example 30+, 40+ on machine with 4 vCPU), many of container processes (childs to containerd-shim) goes into zombie (defunct) state and whole k8s cluster collapse.

First of all we think that's a Rook-Ceph problem with XFS storage (described at https://github.com/rook/rook/issues/3132#issuecomment-580508760), so we switched to EXT4 (because we cannot do upgrade of kernel to 5.6+), but during last weekend this happened again, and we are sure that this case is related to Disk Pressure event. Last contact with (already) dead node was 21-01, @13:50, but load starts growing at 13:07 and quickly goes to 30.5:

load 1m

/var usage goes from 89.97% to 90%+ exactly at 13:07 this day:

/var usage

Can you point us what we need to check in k8s configuration, logs or whatever else to find out what is going on? Why k8s is collapsing during quite normal event?

(For clarification: we know that we're using quite old versions, but we'll do a complex upgrade of environment within few weeks).

  • Any info on `dmesg` on the collapsing node of the cluster? – AndD Jan 24 '22 at 08:51
  • Are you running SSDs or spinny disks? Is is possible that the container, whatever it is doing, is thrashing spinny disks around and throwing the seek times through the roof? This seems to me like the CPU is spending most of it's time waiting for the disks to respond. – Matt Clark Jan 24 '22 at 10:18
  • You mentioned vCPU - is this a machine running in AWS? Are you pushing yourself past your credit limits on any of your resources (EBS/CPU) and getting IO rate limited? – Matt Clark Jan 24 '22 at 10:20
  • And lastly, is there plenty of free space available on all disks? Ideally `>10%` – Matt Clark Jan 24 '22 at 10:22
  • @MattClark a lot of questions ;) You are right, CPU spend a lot of time on I/O waits, but this happens only when Disk Pressure event occurs. Before this, CPU 1m load on idle (when applications are started but do nothing) is much less than 1.0 (typical 0.3-0.5). Disk: SSD or spinny? Currently i'm not sure, but external disk array we are using probably using spinny disks. There's nothing bad with them when we run without Disk Pressure event. Cluster is ours, on-premise, built on virtual env on vSphere. All disks except /var are quite empty, utilization around 30% on every mount point. – Mariusz Jędrzejewski Jan 24 '22 at 11:38
  • @AndD there's a lot of logs from ceph: Jan 21 12:08:19 s130l0167 kernel: libceph: mon1 10.43.41.170:6789 socket closed (con state OPEN) Jan 21 12:08:19 s130l0167 kernel: libceph: mon1 10.43.41.170:6789 session lost, hunting for new mon Jan 21 12:08:24 s130l0167 kernel: [1383434.847094] libceph: mon0 10.43.41.170:6789 socket closed (con state OPEN) Jan 21 12:08:24 s130l0167 kernel: libceph: mon0 10.43.41.170:6789 socket closed (con state OPEN) Jan 21 12:08:26 s130l0167 kernel: [1383436.961484] libceph: mon1 10.43.53.225:6789 socket closed (con state CONNECTING) – Mariusz Jędrzejewski Jan 24 '22 at 11:48
  • Do you have any system monitoring or logging enabled? You mentioned CentOS, so you have _sysfs_. I wrote [this post](https://stackoverflow.com/questions/38703598/how-would-i-check-how-busy-the-hdd-is-with-php/44379285#44379285) a while ago, maybe it can help you! – Matt Clark Jan 24 '22 at 20:05
  • I have a feeling that whatever the application is doing, is better suited for a RAM Disk or SSD. Seems to me like its beating up your array, but will know for sure if you check some stats and see what it's doing. You can also use utilities like `iotop` to see instantaneous values. – Matt Clark Jan 24 '22 at 20:07
  • Are ODSs Pods limited in memory or cpu usage? It may depend on the fact that when the node goes down, the other OSDs start working like crazy to reach the desired replica of data (and if they are not limited in resources, that could be a problem I think) – AndD Jan 25 '22 at 06:59

0 Answers0