Using Python subprocesses to avoid OOM (memory not freed in a loop)?

Question

As my computer memory (in a digital environment, sandbox) was only 8gbs (now 14), I am trying to make my script more efficient! It is a script to analyze pictures and it works perfectly, however, at some point I ran into the infamous memory error. Basically I run a big loop to analyze some pictures, and at the end of each loop I perform the following functions to clear out the memory:

del huge amount of variables
plt.close()
gc.collect()

Each of these bits of codes improved my script. Without them I could analyze ~2 pictures, and with these bits of codes I can process around 15 pictures, but then I still run into a memory error. Why this is still happening is a much more complicated question from what I understand, but at this point I am more focused on the solution.

After much troubleshooting, I found that, in particular, one function in my script is very memory expensive. According to some pages on stackoverflow, I should be able to make the script perform better if I am able to subprocess this piece of code. Unfortunately I am not too familiar with programming and I have reached the point where I see no more progress and am forced to get some help.

I have tried to parse data between the two scripts, which seems to be a problem as it is an array an not a string. Furthermore, I was able to write the problematic line in a second script and open it directly in the first script, however, the idea of a subprocess is that the code runs and closes, making it less memory extensive..

I will not share the entire script as there are many pre-processing, processing and data harvesting lines that I think are not necessary to solve the problem. The major problem is at line #4

from plantcv import plantcv as pcv
import os
from os import listdir
import gc
import subprocess
from subprocess import Popen, PIPE

# In[2]:

list=os.listdir("directory with pictures")

# In[3]:
for x in list:
    img, path, filename = pcv.readimage("directory with pictures" +x)

# In[4] HEAVY LINE IN CODE!:
    mask_naive = pcv.naive_bayes_classifier(img,    pdf_file="classifier model")

# many processing steps on mask_naive variable

# In[78]:with open('csv file being updated at end of loop', mode='a') as employee_file:
        employee_writer = csv.writer(employee_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

# In[79]: pcv.print_image(wanted_img_variable, "output directory + x + ".png")

del huge amount of variables
    plt.close()
    gc.collect()

In order to avoid this line 4 in my code, I have tried to write a second .py script in a few ways of which these are two examples:

from first_script_name import img

mask_naive = pcv.naive_bayes_classifier(img,    pdf_file="classifier model")

pass

and

from plantcv import plantcv as pcv
from __main__ import img

mask_naive = pcv.naive_bayes_classifier(img,    pdf_file="classifier model")

In first.py I have tried instead of line #4 the following, and many other probably not so successful codes

p1= subprocess.run("second.py", shell=True, input=img, stdout=subprocess.PIPE, text=True, check=True)

It has to be noted that this script is normally run into a sandbox environment. When I do this and I look at the memory it gives the following output: memory usage sandbox The problem of the memoryerror also arises in the sandbox environment only. Every loop the memory usage slowly increases by approximately 0.15Gb until it reaches the maximum.

When I run the script in my home environment I get the following memory usage: memory usage home environment Where it does seem to increase, but overall it stays stable, and the script can run infinite loops without problems.

I am not too familiar with memory management in a sandbox environment but I think this could be playing a role as well. The desired outcome would be to not have the memory error anymore. Who can guide me in the right direction?

Many thanks.

Why do you believe splitting into more processes will help the problem? Could you point to the specific thing you read that leads you to think that? — Charles Duffy, Aug 05 '19 at 15:08
...insofar as you've got memory that isn't being reaped by the garbage collector, yes, you can ensure that *everything* is reaped by forking off a different process per file (so the process holding the memory can just exit), but it'd be a lot easier to just find whatever's holding references and fix it. — Charles Duffy, Aug 05 '19 at 15:10
Sure I can, I found this solution on this page: https://stackoverflow.com/questions/56126062/how-to-destroy-python-objects-and-free-up-memory — Rivered, Aug 05 '19 at 15:10
Okay, yes -- so what that's basically doing is having a separate Python program per N images, so the program can completely exit (and thus free 100% of its RAM). Honestly, you don't need to use the subprocess module to get that effect at all -- start your program from a shell script that runs the Python program against one file at a time, and there you are. — Charles Duffy, Aug 05 '19 at 15:11
...so, to do that, remove the `os.listdir()`, and instead use `sys.argv` as your list of files to iterate over; then, you can at your shell do something like `for f in "directory with pictures"/*; do ./yourPythonScript "$f"; done`. — Charles Duffy, Aug 05 '19 at 15:13
...but if you *really* want to do that in Python, then `for x in list: subprocess.run(['second.py', os.path.join("directory with pictures", x)])` -- no `shell=True`, you'll note -- and have `second.py` read from `sys.argv[1]` to get the filename to process. — Charles Duffy, Aug 05 '19 at 15:15
..btw, one advantage to using all of `sys.argv[1:]` as your list to iterate over is that it gives the calling process control of how many pictures it hands to each subprocess, so you can do appropriate tuning. In shell, this might mean you do something like: `printf '%s\0' "$PWD/directory with pictures"/* | xargs -n 10 -0 ./your-python-script`, if it can go 10 without running out of memory; tune to fit. — Charles Duffy, Aug 05 '19 at 15:18
Dear Charles, Thank you for your help. I have been focussing on your first solution. Please not that I am working — Rivered, Aug 05 '19 at 19:33
Sorry for my last comment, I pressed enter by accident. Dear Charles, Thank you for your help. I have been focussing on your first solution. Please note that I am working on a windows machine, in anaconda prompt python. Perhaps that is why it is not working properly. If I execute your code exactly as you had described it ``` for f in "D:/folder1/folder2"/*; do "D:/folder1/script_to_apply.py" "$f"; done ``` I get the following error : ``` SyntaxError:invalid syntax, with a ^ pointing at the * in "/*; ``` — Rivered, Aug 05 '19 at 19:40
That happens when I execute the piece of code in python, when I execute the code in the anaconda shell, I get the following error: not expecting f on this moment — Rivered, Aug 05 '19 at 19:47
Right -- I don't know Windows cmd, so the only shell advice I can give is for POSIX-compliant shells like bash. — Charles Duffy, Aug 05 '19 at 20:07
Thank you Charles Duffy for suggesting to run it outside of the python script. I have posted my non POSIX-compliant shell (windows) answer below. — Rivered, Aug 23 '19 at 17:01

score 1 · Accepted Answer · answered Aug 23 '19 at 17:00

1

This solution worked for me in the cmd environment of Windows:

for /F %i in ('dir "Directory_containing_files" /b /s') do (python Executed_script.py -i %i -o "Folder_to_write_output_to")

answered Aug 23 '19 at 17:00

Rivered

741
7
27

Using Python subprocesses to avoid OOM (memory not freed in a loop)?

1 Answers1