0

My python script is getting Killed with the following error message in dmesg:

[Sat Dec  3 11:25:59 2022] Out of memory: Killed process 1126 (python) total-vm:17534768kB, anon-rss:14299092kB, file-rss:2752kB, shmem-rss:0kB, UID:1000 pgtables:28200kB oom_score_adj:0

It is a long running script - taking a few hours to complete. I have two of these scripts running. I run them in a virtual env and using tmux.

The scripts both read some CSV files into memory (around 2 GB maximum) and write them out as smaller files. I use pandas. As it iterates through files, I don't 'copy' or 'store' the data in memory once the loop is completed.

The script largely follow this pattern:

for os.walk:
  df = read_csv()
  smaller_dfs = df.groupby
  for small_df in smaller_dfs:
    small_df.to_csv()

With each iteration of the for loop, the last csv's data should not be retained in memory...I think.

The VM has 16GB of memory.

I also ran top a few times during script execution to see that the total memory usage is around 2-3 GB with around 7-8GB being used by 'buff/cache'. So it is unexpected that the program suddenly runs out of memory.

Here is the table from dmesg where you can see the two python processes:

[Sat Dec  3 11:25:59 2022] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Sat Dec  3 11:25:59 2022] [    184]     0   184    17344      979   147456        0          -250 systemd-journal
[Sat Dec  3 11:25:59 2022] [    217]     0   217     5054      988    65536        0         -1000 systemd-udevd
[Sat Dec  3 11:25:59 2022] [    342]     0   342    70036     4488    90112        0         -1000 multipathd
[Sat Dec  3 11:25:59 2022] [    413]   102   413    22665      554    77824        0             0 systemd-timesyn
[Sat Dec  3 11:25:59 2022] [    484]   100   484     6850      921    77824        0             0 systemd-network
[Sat Dec  3 11:25:59 2022] [    487]   101   487     6136     1444    94208        0             0 systemd-resolve
[Sat Dec  3 11:25:59 2022] [    522]     0   522    60290      929   102400        0             0 accounts-daemon
[Sat Dec  3 11:25:59 2022] [    523]     0   523      637      183    49152        0             0 acpid
[Sat Dec  3 11:25:59 2022] [    527]     0   527     2137      567    53248        0             0 cron
[Sat Dec  3 11:25:59 2022] [    529]   103   529     1894      915    49152        0          -900 dbus-daemon
[Sat Dec  3 11:25:59 2022] [    538]     0   538    20475      740    61440        0             0 irqbalance
[Sat Dec  3 11:25:59 2022] [    540]     0   540     7407     2833    90112        0             0 networkd-dispat
[Sat Dec  3 11:25:59 2022] [    542]     0   542    59108      902    98304        0             0 polkitd
[Sat Dec  3 11:25:59 2022] [    545]   104   545    56125      958    81920        0             0 rsyslogd
[Sat Dec  3 11:25:59 2022] [    547]     0   547   363271     1292   204800        0             0 amazon-ssm-agen
[Sat Dec  3 11:25:59 2022] [    555]     0   555   274152     4234   290816        0          -900 snapd
[Sat Dec  3 11:25:59 2022] [    557]     0   557     4336      998    69632        0             0 systemd-logind
[Sat Dec  3 11:25:59 2022] [    565]     0   565    98885     1209   135168        0             0 udisksd
[Sat Dec  3 11:25:59 2022] [    566]     0   566      951      560    49152        0             0 atd
[Sat Dec  3 11:25:59 2022] [    599]     0   599    78585      931   106496        0             0 ModemManager
[Sat Dec  3 11:25:59 2022] [    606]     0   606     1840      436    53248        0             0 agetty
[Sat Dec  3 11:25:59 2022] [    611]     0   611    13313      362    94208        0             0 nginx
[Sat Dec  3 11:25:59 2022] [    613]    33   613    13454      821    94208        0             0 nginx
[Sat Dec  3 11:25:59 2022] [    614]    33   614    13454      821    94208        0             0 nginx
[Sat Dec  3 11:25:59 2022] [    628]     0   628     1459      362    53248        0             0 agetty
[Sat Dec  3 11:25:59 2022] [    657]     0   657    27034     2733   114688        0             0 unattended-upgr
[Sat Dec  3 11:25:59 2022] [    750]     0   750     3046      938    61440        0         -1000 sshd
[Sat Dec  3 11:25:59 2022] [    780]     0   780     3452     1050    73728        0             0 sshd
[Sat Dec  3 11:25:59 2022] [    789]  1000   789     4731     1118    73728        0             0 systemd
[Sat Dec  3 11:25:59 2022] [    793]  1000   793    25976      822    98304        0             0 (sd-pam)
[Sat Dec  3 11:25:59 2022] [    919]  1000   919     3486      809    73728        0             0 sshd
[Sat Dec  3 11:25:59 2022] [    920]  1000   920     2543      976    61440        0             0 bash
[Sat Dec  3 11:25:59 2022] [   1050]     0  1050   365631     1936   225280        0             0 ssm-agent-worke
[Sat Dec  3 11:25:59 2022] [   1074]  1000  1074     2534     1281    65536        0             0 tmux: server
[Sat Dec  3 11:25:59 2022] [   1075]  1000  1075     2565      914    57344        0             0 bash
[Sat Dec  3 11:25:59 2022] [   1105]  1000  1105     2564      967    53248        0             0 bash
[Sat Dec  3 11:25:59 2022] [   1126]  1000  1126  4383692  3575461 28876800        0             0 python
[Sat Dec  3 11:25:59 2022] [   1131]  1000  1131     2752      785    69632        0             0 top
[Sat Dec  3 11:25:59 2022] [   1174]  1000  1174     2757      754    57344        0             0 top
[Sat Dec  3 11:25:59 2022] [   1278]     0  1278     3452     1040    61440        0             0 sshd
[Sat Dec  3 11:25:59 2022] [   1372]  1000  1372     3486      811    61440        0             0 sshd
[Sat Dec  3 11:25:59 2022] [   1373]  1000  1373     1473      564    45056        0             0 sftp-server
[Sat Dec  3 11:25:59 2022] [   1382]  1000  1382     2760      793    61440        0             0 top
[Sat Dec  3 11:25:59 2022] [   1404]  1000  1404     2760      794    65536        0             0 top
[Sat Dec  3 11:25:59 2022] [   1569]  1000  1569   436810   384689  3260416        0             0 python
[Sat Dec  3 11:25:59 2022] [   1578]  1000  1578     2760      817    57344        0             0 top

When I run free during execution, I see something like this:

              total        used        free      shared  buff/cache   available
Mem:       16240152     6257732     6280208         868     3702212     9695756
Swap:             0           0           0
  1. What steps can I take to work out why this is happening? - In the table it would appear that the two python processes are no where near taking up the 16 GB of memory available?

  2. Is there some function I can call with the python script that will make this less likely to happen? e.g. garbage collection?

  3. Is there something I can do with the swap?

Thanks!

BYZZav
  • 1,418
  • 1
  • 19
  • 35
  • 1
    Looking at `total_vm` and `rss` the second python process is much smaller, about 10x. So you might look at the differences in input first, and then analyze how the input data is processed. Maybe switch from from a reading, processing, writing model, to some sort of stream processing. – Olaf Dietsche Dec 03 '22 at 13:11
  • Yes that's true - the second script deals with 70 MB files generally. Whereas the script being killed deals with 500MB-2GB files. Its definitely more prone to termination but the numbers from free/top and the dmesg outputs don't suggest memory usage nearing the 16GB available. – BYZZav Dec 03 '22 at 13:14
  • 1
    Maybe do a `del smaller_dfs` after the inner loop? See also https://stackoverflow.com/q/8237647/1741542 or https://stackoverflow.com/q/26545051/1741542 – Olaf Dietsche Dec 03 '22 at 13:17

0 Answers0