My python script is getting Killed
with the following error message in dmesg:
[Sat Dec 3 11:25:59 2022] Out of memory: Killed process 1126 (python) total-vm:17534768kB, anon-rss:14299092kB, file-rss:2752kB, shmem-rss:0kB, UID:1000 pgtables:28200kB oom_score_adj:0
It is a long running script - taking a few hours to complete. I have two of these scripts running. I run them in a virtual env and using tmux.
The scripts both read some CSV files into memory (around 2 GB maximum) and write them out as smaller files. I use pandas
. As it iterates through files, I don't 'copy' or 'store' the data in memory once the loop is completed.
The script largely follow this pattern:
for os.walk:
df = read_csv()
smaller_dfs = df.groupby
for small_df in smaller_dfs:
small_df.to_csv()
With each iteration of the for loop, the last csv's data should not be retained in memory...I think.
The VM has 16GB of memory.
I also ran top
a few times during script execution to see that the total memory usage is around 2-3 GB with around 7-8GB being used by 'buff/cache'. So it is unexpected that the program suddenly runs out of memory.
Here is the table from dmesg
where you can see the two python processes:
[Sat Dec 3 11:25:59 2022] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Sat Dec 3 11:25:59 2022] [ 184] 0 184 17344 979 147456 0 -250 systemd-journal
[Sat Dec 3 11:25:59 2022] [ 217] 0 217 5054 988 65536 0 -1000 systemd-udevd
[Sat Dec 3 11:25:59 2022] [ 342] 0 342 70036 4488 90112 0 -1000 multipathd
[Sat Dec 3 11:25:59 2022] [ 413] 102 413 22665 554 77824 0 0 systemd-timesyn
[Sat Dec 3 11:25:59 2022] [ 484] 100 484 6850 921 77824 0 0 systemd-network
[Sat Dec 3 11:25:59 2022] [ 487] 101 487 6136 1444 94208 0 0 systemd-resolve
[Sat Dec 3 11:25:59 2022] [ 522] 0 522 60290 929 102400 0 0 accounts-daemon
[Sat Dec 3 11:25:59 2022] [ 523] 0 523 637 183 49152 0 0 acpid
[Sat Dec 3 11:25:59 2022] [ 527] 0 527 2137 567 53248 0 0 cron
[Sat Dec 3 11:25:59 2022] [ 529] 103 529 1894 915 49152 0 -900 dbus-daemon
[Sat Dec 3 11:25:59 2022] [ 538] 0 538 20475 740 61440 0 0 irqbalance
[Sat Dec 3 11:25:59 2022] [ 540] 0 540 7407 2833 90112 0 0 networkd-dispat
[Sat Dec 3 11:25:59 2022] [ 542] 0 542 59108 902 98304 0 0 polkitd
[Sat Dec 3 11:25:59 2022] [ 545] 104 545 56125 958 81920 0 0 rsyslogd
[Sat Dec 3 11:25:59 2022] [ 547] 0 547 363271 1292 204800 0 0 amazon-ssm-agen
[Sat Dec 3 11:25:59 2022] [ 555] 0 555 274152 4234 290816 0 -900 snapd
[Sat Dec 3 11:25:59 2022] [ 557] 0 557 4336 998 69632 0 0 systemd-logind
[Sat Dec 3 11:25:59 2022] [ 565] 0 565 98885 1209 135168 0 0 udisksd
[Sat Dec 3 11:25:59 2022] [ 566] 0 566 951 560 49152 0 0 atd
[Sat Dec 3 11:25:59 2022] [ 599] 0 599 78585 931 106496 0 0 ModemManager
[Sat Dec 3 11:25:59 2022] [ 606] 0 606 1840 436 53248 0 0 agetty
[Sat Dec 3 11:25:59 2022] [ 611] 0 611 13313 362 94208 0 0 nginx
[Sat Dec 3 11:25:59 2022] [ 613] 33 613 13454 821 94208 0 0 nginx
[Sat Dec 3 11:25:59 2022] [ 614] 33 614 13454 821 94208 0 0 nginx
[Sat Dec 3 11:25:59 2022] [ 628] 0 628 1459 362 53248 0 0 agetty
[Sat Dec 3 11:25:59 2022] [ 657] 0 657 27034 2733 114688 0 0 unattended-upgr
[Sat Dec 3 11:25:59 2022] [ 750] 0 750 3046 938 61440 0 -1000 sshd
[Sat Dec 3 11:25:59 2022] [ 780] 0 780 3452 1050 73728 0 0 sshd
[Sat Dec 3 11:25:59 2022] [ 789] 1000 789 4731 1118 73728 0 0 systemd
[Sat Dec 3 11:25:59 2022] [ 793] 1000 793 25976 822 98304 0 0 (sd-pam)
[Sat Dec 3 11:25:59 2022] [ 919] 1000 919 3486 809 73728 0 0 sshd
[Sat Dec 3 11:25:59 2022] [ 920] 1000 920 2543 976 61440 0 0 bash
[Sat Dec 3 11:25:59 2022] [ 1050] 0 1050 365631 1936 225280 0 0 ssm-agent-worke
[Sat Dec 3 11:25:59 2022] [ 1074] 1000 1074 2534 1281 65536 0 0 tmux: server
[Sat Dec 3 11:25:59 2022] [ 1075] 1000 1075 2565 914 57344 0 0 bash
[Sat Dec 3 11:25:59 2022] [ 1105] 1000 1105 2564 967 53248 0 0 bash
[Sat Dec 3 11:25:59 2022] [ 1126] 1000 1126 4383692 3575461 28876800 0 0 python
[Sat Dec 3 11:25:59 2022] [ 1131] 1000 1131 2752 785 69632 0 0 top
[Sat Dec 3 11:25:59 2022] [ 1174] 1000 1174 2757 754 57344 0 0 top
[Sat Dec 3 11:25:59 2022] [ 1278] 0 1278 3452 1040 61440 0 0 sshd
[Sat Dec 3 11:25:59 2022] [ 1372] 1000 1372 3486 811 61440 0 0 sshd
[Sat Dec 3 11:25:59 2022] [ 1373] 1000 1373 1473 564 45056 0 0 sftp-server
[Sat Dec 3 11:25:59 2022] [ 1382] 1000 1382 2760 793 61440 0 0 top
[Sat Dec 3 11:25:59 2022] [ 1404] 1000 1404 2760 794 65536 0 0 top
[Sat Dec 3 11:25:59 2022] [ 1569] 1000 1569 436810 384689 3260416 0 0 python
[Sat Dec 3 11:25:59 2022] [ 1578] 1000 1578 2760 817 57344 0 0 top
When I run free
during execution, I see something like this:
total used free shared buff/cache available
Mem: 16240152 6257732 6280208 868 3702212 9695756
Swap: 0 0 0
What steps can I take to work out why this is happening? - In the table it would appear that the two python processes are no where near taking up the 16 GB of memory available?
Is there some function I can call with the python script that will make this less likely to happen? e.g. garbage collection?
Is there something I can do with the swap?
Thanks!