I'm working on an analysis which requires fitting a model separately to each of multiple data sources (on the order of 10-60). The model optimization (done in pytorch) is separate for each source, I want to save the outputs to a common file (without worrying about locking/race conditions), and I largely run this on a high-performance computing cluster managed with SLURM. For those reasons, I've been using multiprocessing to achieve this, rather than batch array calls with SLURM.
A current set of jobs were cancelled for causing high CPU loads, due to spawning too many threads. The relevant code is as follows:
torch.set_num_threads(1)
import torch.multiprocessing as mp
with mp.Pool(processes=20) as pool:
output_to_save = pool.map(myModelFit, sourcesN)
pool.close()
I chose 20 processes per the request of my HPC admin, since most compute nodes on our cluster have 48 cores. Thus, I would like only 20 threads to run at any given time. However, several hundred threads are spawned, causing excessive CPU usage. The same behavior of more threads than expected is present if I run the analysis on a local server, so I believe the issue is independent of the specifications I give to SLURM (i.e. where I specify '--tasks-per-node 20').
When I tried the suggestion from this answer on my local server, it seemed to cap CPU usage at 100% (the same was also true on the cluster!). Is this still permitting a reasonably efficient use of the CPU? If so, why didn't my other specifications to keep only one thread per process work? Furthermore, it's unclear to me why the pool.map call causes more threads than processes when running the analysis on just one data source (i.e. without a multiprocessing call) generates just one thread. I realize that last part might require knowledge of what specifically is in myModelFit (primarily torch and np calls), but perhaps it might be a consequence of the mp.pool call instead.