I'm using python to do some processing on text files and am having issues with MemoryError
s. Sometimes the file being processed is quite large which means that too much RAM is being used by a multiprocessing Process.
Here is a snippet of my code:
import multiprocessing as mp
import os
def preprocess_file(file_path):
with open(file_path, "r+") as f:
file_contents = f.read()
# modify the file_contents
# ...
# overwrite file
f.seek(0)
f.write(file_contents)
f.truncate()
if __name__ == "main":
with mp.Pool(mp.cpu_count()) as pool:
pool_processes = []
# for all files in dir
for root, dirs, files in os.walk(some_path):
for f in files:
pool_processes.append(os.path.join(root, f))
# start the processes
pool.map(preprocess_file, pool_processes)
I have tried to use the resource package to set a limit to how much RAM each process can use as shown below but this hasn't fixed the issue, and I still get MemoryError
s being raised which leads me to believe it's the pool.map
which is causing issues. I was hoping to have each process deal with the exception individually so that the file could be skipped rather than crashing the whole program.
import resource
def preprocess_file(file_path):
try:
hard = os.sysconf("SC_PAGE_SIZE") * os.sysconf("SC_PHYS_PAGES") # total bytes of RAM in machine
soft = (hard - 512 * 1024 * 1024) // mp.cpu_count() # split between each cpu and save 512MB for the system
resource.setrlimit(resource.RLIMIT_AS, (soft, hard)) # apply limit
with open(file_path, "r+") as f:
# ...
except Exception as e: # bad practice - should be more specific but just a placeholder
# ...
How can I let an individual process run out of memory while letting the other processes continue unaffected? Ideally I want to catch the exception within the preprocess_file
file so that I can log exactly which file caused the error.
Edit: The preprocess_file
function does not share data with any other processes so there is no need for shared memory. The function also needs to read the entire file at once as the file is reformatted which cannot be done line by line.
Edit 2: The traceback from the program is below. As you can see, the error doesn't actually point to the file being run, and instead comes from the package's files.
Process ForkPoolWorker-2:
Traceback (most recent call last):
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 125, in worker
File "/usr/lib64/python3.6/multiprocessing/queues.py", line 341, in put
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 39, in __init__
MemoryError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 130, in worker
File "/usr/lib64/python3.6/multiprocessing/queues.py", line 341, in put
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 39, in __init__
MemoryError