I'm currently using python's multiprocessing module with a pool to run a function millions of times, simultaneously. While multiprocessing works well, the function is so lightweight that barely 30% of each core is used and threads are maxed out only during the Locking(). Looking at my script's profile, indeed locking is the most expensive.
Given each function run is very short, the trade off between locking each time I'm mapping to the function and running the function is not worth it (in fact I'm getting better performance by running it serially; 15 mins parallelized vs. 4.5 mins serial).
The function writes to independent files so calls are completely independent. Is it possible to 'mimic' running/calling the same parallelized python script multiple times (with different inputs) to make more use of the CPU?
Current Code:
pool = Pool(cpu_count(), initializer=tqdm.tqdm.set_lock, initargs=(Lock(),))
for _ in tqdm.tqdm(pool.imap_unordered(parallel_process, pubfiles, chunksize=70), total=nfiles, desc='Parsing files'):
pass
EDIT:
To ensure it has nothing to do with tqdm's locking, modifying the code to the following achieves the same issue:
pool = Pool(cpu_count())
for i in pool.imap_unordered(parallel_process, files, chunksize=70):
print(i)
I've profiled my code for a while and most expensive processes seem to be related to locking (?)/multiprocessing in general. The actual function is very close to the bottom of the processing time.