0

I'm using GCP's Cloud Notebook VM's. I have a 200+ gb RAM VM running and am attempting to download about 70gb of data from BigQuery into memory using the bigquery storage engine.

Once it gets to around 50gb the kernel crashes --

Tailing the logs, sudo tail -20 /var/log/syslog - here's what I find:

Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.550367] Task in /system.slice/jupyter.service killed as a result of limit of /system.slice/jupyter.service
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.563843] memory: usage 53350876kB, limit 53350964kB, failcnt 1708893
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.570582] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.578694] kmem: usage 110900kB, limit 9007199254740988kB, failcnt 0
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.585267] Memory cgroup stats for /system.slice/jupyter.service: cache:752KB rss:53239292KB rss_huge:0KB mapped_file:60KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:53239292KB inactive_file:400KB active_file:248KB unevictable:0KB
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.612963] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.621645] [  787]  1003   787    99396    17005      63       3        0             0 jupyter-lab
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.632295] [ 2290]  1003  2290     4996      966      14       3        0             0 bash
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.642309] [13143]  1003 13143  1272679    26639     156       6        0             0 python
Dec  2 13:35:58 pytorch-20200908-152245 kernel: [60783.652528] [ 5833]  1003  5833 16000467 13268794   26214      61        0             0 python
Dec  2 13:35:58 pytorch-20200908-152245 kernel: [60783.661384] [ 6813]  1003  6813     4996      936      14       3        0             0 bash
Dec  2 13:35:58 pytorch-20200908-152245 kernel: [60783.670033] Memory cgroup out of memory: Kill process 5833 (python) score 996 or sacrifice child
Dec  2 13:35:58 pytorch-20200908-152245 kernel: [60783.680823] Killed process 5833 (python) total-vm:64001868kB, anon-rss:53072876kB, file-rss:4632kB, shmem-rss:0kB
Dec  2 13:38:07 pytorch-20200908-152245 sync_gcs_service.sh[806]: GCS bucket is not specified in GCE metadata, skip GCS sync
Dec  2 13:39:03 pytorch-20200908-152245 bash[787]: [I 13:39:03.463 LabApp] Saving file at /outlog.txt

I followed this guidance and allocated 100gb of RAM How to increase Jupyter notebook Memory limit? but it's still crashing at around 55gb. e.g., 53350964kB is the limit in the logs.

How can I utilize the available memory of my machine? Thanks!

Tacking on what worked - changing this config setting:

/sys/fs/cgroup/memory/system.slice/jupyter.service/memory.limit_in_bytes to a higher number.

Jeff James
  • 19
  • 5

1 Answers1

1

I can see here "Cgroup out of memory" means that instance has sufficient memory and the process that is being killed in a cgroup. For visualized workload this can be possible as docker containers can cause this issue.

a) To identify the cgroup

system-cgtop

b) Check the limit of the cgroup

cat /sys/fs/cgroup/memory/[CGROUP_NAME]/memory.limit_in_bytes

c) Adjust the limit, please adjust by editing configuration file of POD, memory limit for Docker Container. Update the limit for raw cgroup

echo [NUMBER_OF_BYTES] > /sys/fs/cgroup/memory/[CGROUP_NAME]/memory.limit_in_bytes

Mahboob
  • 1,877
  • 5
  • 19