SLURM Python Script accumulating memory in loop

Question

I am running a simple python script on SLURM scheduler for HPC. It reads in a data set (approximately 6GB) and plots and saves images of parts of the data. There are several of these data files so I use a loop to iterate until I finish plotting data from each file.

For some reason, however, there is a memory usage increase in each loop. I've mapped my variables using the getsizeof() but they don't seem to change over iterations. So I'm not sure where this memory "leak" could be coming from.

Here's my script:

import os, psutil
import sdf_helper as sh
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as plticker
plt.rcParams['figure.figsize'] = [6, 4]
plt.rcParams['figure.dpi'] = 120 # 200 e.g. is really fine, but slower
from sys import getsizeof


for i in range(5,372):
    plt.clf()   
    fig, ax = plt.subplots()
    #dd gets data using the epoch specific SDF file reader sh.getdata
    dd = sh.getdata(i,'/dfs6/pub/user');
    #extract density data as 2D array
    den = dd.Derived_Number_Density_electron.data.T;
    nmin = np.min(dd.Derived_Number_Density_electron.data[np.nonzero(dd.Derived_Number_Density_electron.data)])
    #extract grid points as 2D array
    xy = dd.Derived_Number_Density_electron.grid.data
    #extract single number time
    time = dd.Header.get('time')
    #free up memory from dd
    dd = None
    #plotting
    plt.pcolormesh(xy[0], xy[1],np.log10(den), vmin = 20, vmax = 30)
    cbar = plt.colorbar()
    cbar.set_label('Density in log10($m^{-3}$)')
    plt.title("time:   %1.3e s \n Min e- density:   %1.2e $m^{-3}$" %(time,nmin))
    ax.set_facecolor('black')
    plt.savefig('D00%i.png'%i, bbox_inches='tight')
    print("dd: ", getsizeof(dd))
    print("den: ",getsizeof(den))
    print("nmin: ",getsizeof(nmin))
    print("xy: ",getsizeof(xy))
    print("time: ",getsizeof(time))
    print("fig: ",getsizeof(fig))
    print("ax: ",getsizeof(ax))
    process = psutil.Process(os.getpid())
    print(process.memory_info().rss)

output

Reading file /dfs6/pub/user/0005.sdf
dd:  16
den:  112
nmin:  32
xy:  56
time:  24
fig:  48
ax:  48
8991707136

Reading file /dfs6/pub/user0006.sdf
dd:  16
den:  112
nmin:  32
xy:  56
time:  24
fig:  48
ax:  48
13814497280

Reading file /dfs6/pub/user/0007.sdf
dd:  16
den:  112
nmin:  32
xy:  56
time:  24
fig:  48
ax:  48
18648313856

SLURM Input

#!/bin/bash

#SBATCH -p free
#SBATCH --job-name=epochpyd1
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=20000


#SBATCH --mail-type=begin,end
#SBATCH --mail-user=**

module purge
module load python/3.8.0

python3 -u /data/homezvol0/user/CNTDensity.py > density.out

SLURM output

/data/homezvol0/user/CNTDensity.py:21: RuntimeWarning: divide by zero encountered in log10
  plt.pcolormesh(xy[0], xy[1],np.log10(den), vmin = 20, vmax = 30)
/export/spool/slurm/slurmd.spool/job1910549/slurm_script: line 16:  8004 Killed                  python3 -u /data/homezvol0/user/CNTDensity.py > density.out
slurmstepd: error: Detected 1 oom-kill event(s) in step 1910549.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

As far as I can tell everything seems to be working. Not sure what could be taking up more than 20GB of memory.

EDIT So I began commenting out sections of the loop from the bottom up. It's now clear that pcolormesh is the culprit.

I've added (Closing pyplot windows):

fig.clear()
plt.clf()
plt.close('all')
fig = None
ax = None
del fig
del ax

To the end but the memory keeps climbing no matter what. I'm at a total loss at what's happening.

Jerry101 · Answer 1 · 2020-12-14T22:38:42.493

You're on the right track, having made it visible how much the memory accumulates on each iteration. The next step in debugging is to think of hypotheses for where that memory could be accumulating and ways to test those hypotheses.

One hold onto memory after each iteration are the variables like den. You can rule out those hypotheses (and thus narrow in on the problem) by clearing those variables as the code does via dd = None, or deleting them via del dd, or moving portions of the loop body into subroutines so some variables go away when those subroutines return. (And factoring out subroutines can also make those parts more reusable and easier to test.) This technique will rule out some possible causes of the problem but I don't expect these variable assignments to accumulate memory over iterations, which it would if the code added data to a dict or a list on each iteration.

Another hypothesis is that could be state accumulating in matplotlib that doesn't get cleared by plt.clf() or state accumulating in sdf_helper. I don't know enough about these libraries to provide direct insight but their documentation should say how to clear out state. Even without knowing how to clear their state, we can think of ways to test these hypotheses. E.g. comment out the plt calls or at least the data-intensive calls, then see if the memory still accumulates.

You might think of more hypotheses than I did. Brainstorming hypotheses first is a good approach since one of them might be an obviously best candidate, or one of them might be a lot easier to test than the others.

Beware that there could be multiple causes of accumulating memory, in which case fixing one cause will reduce the memory accumulation but won't fix it. Since you're measuring the memory accumulation, you'll be able to detect this. In many debugging situations, we can't see the incremental contributions of multiple causes to a problem such as flakey results, so an alternate technique is to cut out everything that might be causing the problem, then add them back one at a time.

Additions

Now that you've narrowed the problem to pcolormesh, the next step is reading the docs or tutorials on how matplotlib and pcolormesh use memory. Also, a web search for pcolormesh memory leak finds specific tips on this.

The easiest thing to try is to add a call to ax.cla() to clear the axes, as in this example.

You could switch from pyplot to matplotlib's object-oriented interface which doesn't retain as much if any global state. In contrast, I think pyplot retains the fig and ax, in which case releasing your variables isn't enough to release their objects.

Apparently imshow uses less memory and time than pcolormesh, assuming your data fits on a rectangular grid.

Note Issue like #1741 which recommends creating a pcolormesh just once, then setting its data in each loop iteration -- can you do mesh = plt.pcolormesh(...) once, then something like mesh.set_array(np.log10(den)) in each iteration? It also recommends calling cla().

Thanks for replying! I've made some edits. Looks like pcolormesh is the culprit. I've tried other ways to delete or free up the figure (https://stackoverflow.com/questions/11140787/closing-pyplot-windows) but the memory keeps climing — ddwong, Dec 14 '20 at 06:09
@ddwong I added some notes on pcolormesh memory retention from a web search and peeking into the Matplotlib docs. You might have to dive deeper into those docs than I did. — Jerry101, Dec 14 '20 at 22:39

SLURM Python Script accumulating memory in loop

1 Answers1