Summary:
I'm trying to use multiprocess and multiprocessing to parallelise work with the following attributes:
- Shared datastructure
- Multiple arguments passed to a function
- Setting number of processes based on current system
Errors:
My approach works for a small amount of work but fails with the following on larger tasks:
OSError: [Errno 24] Too many open files
Solutions tried
Running on a macOS Catalina system, ulimit -n
gives 1024
within Pycharm.
Is there a way to avoid having to change ulimit
? I want to avoid this as the code will ideally work out of the box for various sytems.
I've seen in related questions like this thread that recommend using .join and gc.collect in the comments, other threads recommend closing any opened files but I do not access files in my code.
import gc
import time
import numpy as np
from math import pi
from multiprocess import Process, Manager
from multiprocessing import Semaphore, cpu_count
def do_work(element, shared_array, sema):
shared_array.append(pi*element)
gc.collect()
sema.release()
# example_ar = np.arange(1, 1000) # works
example_ar = np.arange(1, 10000) # fails
# Parallel code
start = time.time()
# Instantiate a manager object and a share a datastructure
manager = Manager()
shared_ar = manager.list()
# Create semaphores linked to physical cores on a system (1/2 of reported cpu_count)
sema = Semaphore(cpu_count()//2)
job = []
# Loop over every element and start a job
for e in example_ar:
sema.acquire()
p = Process(target=do_work, args=(e, shared_ar, sema))
job.append(p)
p.start()
_ = [p.join() for p in job]
end_par = time.time()
# Serial code equivalent
single_ar = []
for e in example_ar:
single_ar.append(pi*e)
end_single = time.time()
print(f'Parallel work took {end_par-start} seconds, result={sum(list(shared_ar))}')
print(f'Serial work took {end_single-end_par} seconds, result={sum(single_ar)}')