1

I have a complex python script. Inside a loop I call a function with multiprocessing and inside that function I call an external program (pdfinfo) with subprocess popen.

My program runs for a while I can see the VIRT memory steadily increasing (with the top command) until after sometime the system runs out of memory and shows this message:

Traceback (most recent call last):
  File "classify_pdf.py", line 603, in <module>
    preprocessing_list[loop] = da.get_preprocessing_data(batch_files, metadata, cores)
  File "/home/student/.../src/data.py", line 87, in get_preprocessing_data
    properties = fp.pre_extract_pdf_properties(batch_files, cores)
  File "/home/student/.../src/features/pdf_properties.py", line 73, in pre_extract_pdf_properties
    pool = Pool(num_cores)
  File "/usr/lib/python3.5/multiprocessing/context.py", line 118, in Pool
    context=self.get_context())
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 168, in __init__
    self._repopulate_pool()
  File "/usr/lib/python3.5/multiprocessing/pool.py", line 233, in _repopulate_pool
    w.start()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 67, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

After interrupting the process with Crtl-C there are still many of the python processes still running like this (I show them with ps aux | grep ptyhon). Thousands even and they even remain when I close the session with the server and log back in.

user1+  53872  0.0  0.0 5444552    0 ?        S    Aug29   0:00 python classify_pdf.py -fp /data/allfiles/ -repo
user1+  53873  0.0  0.0 5444552    0 ?        S    Aug29   0:00 python classify_pdf.py -fp /data/allfiles/ -repo
user1+  53876  0.0  0.0 5444552    0 ?        S    Aug29   0:00 python classify_pdf.py -fp /data/allfiles/ -repo

But how come there are still so many processes still alive even after I interrupt my script? Does it have something to do with using multiprocessing and a subprocess inside a loop? Is the fork for popen creating additional processes? but why won't they end?

BTW, the part of the code where this happens is

pool = Pool(num_cores)
res = pool.map(pdfinfo_get_pdf_properties, files)
pool.close()
pool.join() 
res_fix={}
for x in res:
    res_fix[splitext(basename(x[1]))[0]] = x[0]
return res_fix

and inside pdfinfo_get_pdf_properties this is called

output = subprocess.Popen(["pdfinfo", file_path],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE).communicate()[0].decode(errors='ignore')
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
Atirag
  • 1,660
  • 7
  • 32
  • 60
  • Possible duplicate of [When to call .join() on a process?](https://stackoverflow.com/questions/14429703/when-to-call-join-on-a-process) – stovfl Aug 30 '19 at 06:33

0 Answers0