0

I'm working on an analysis which requires fitting a model separately to each of multiple data sources (on the order of 10-60). The model optimization (done in pytorch) is separate for each source, I want to save the outputs to a common file (without worrying about locking/race conditions), and I largely run this on a high-performance computing cluster managed with SLURM. For those reasons, I've been using multiprocessing to achieve this, rather than batch array calls with SLURM.

A current set of jobs were cancelled for causing high CPU loads, due to spawning too many threads. The relevant code is as follows:

torch.set_num_threads(1) 

import torch.multiprocessing as mp
with mp.Pool(processes=20) as pool:
  output_to_save = pool.map(myModelFit, sourcesN)
  pool.close()

I chose 20 processes per the request of my HPC admin, since most compute nodes on our cluster have 48 cores. Thus, I would like only 20 threads to run at any given time. However, several hundred threads are spawned, causing excessive CPU usage. The same behavior of more threads than expected is present if I run the analysis on a local server, so I believe the issue is independent of the specifications I give to SLURM (i.e. where I specify '--tasks-per-node 20').

When I tried the suggestion from this answer on my local server, it seemed to cap CPU usage at 100% (the same was also true on the cluster!). Is this still permitting a reasonably efficient use of the CPU? If so, why didn't my other specifications to keep only one thread per process work? Furthermore, it's unclear to me why the pool.map call causes more threads than processes when running the analysis on just one data source (i.e. without a multiprocessing call) generates just one thread. I realize that last part might require knowledge of what specifically is in myModelFit (primarily torch and np calls), but perhaps it might be a consequence of the mp.pool call instead.

Paul Levy
  • 26
  • 3
  • 1
    PyTorch should naturally be able to use multiple CPUs, so I'm surprised that you're doing this. Unless your model is very tiny, I'd think you could just do them sequentially (i.e. train a new model after the other one is done). Anyway, maybe use `torch.set_num_threads()` to set how many thread will be used during `myModelFit`? – Farzad Abdolhosseini Nov 22 '22 at 21:26
  • Sorry, I might've been unclear in the initial question. myModelFit is called separately for each data source, as an "embarrassingly parallel" type way. I don't need multiple cpus/threads/processes for a given myModelFit, I just want to use as many threads as permitted (in the case of my institutional cluster, 20 for now) to speed up the model optimizations. Doing it sequentially would make it take longer! The answer I linked above did work, so in a sense, I'm curious as to why I needed to specify one thread-per-task there rather than the torch.set_num_threads() call alone being sufficient. – Paul Levy Nov 22 '22 at 23:05
  • 1
    Good to know the problem is fixed. To answer your "it's unclear to me why the pool.map call causes more threads than processes", I don't think `pool.map` is the culprit here. The issue is that the libraries (Torch, Numpy, and even underlying matrix libraries like OMP and MKL), can use multiple threads to run parts of each instance of `myModelFit` (e.g. if you do a matrix multiply) and usually the defaults are set such that they work best in a single-process setup. – Farzad Abdolhosseini Nov 22 '22 at 23:29
  • I see - I somewhat suspected that would be the case, and saw a few threads on stackoverflow that discussed those underlying libraries. I was confused, however, since when I run 'myModelFit' on just one data source, it seemingly takes up only one thread. That would suggest that using 'pool.map' on N data sources should also generate only N threads, yet that was not the case until I followed the advice in the answer linked in my original question. – Paul Levy Nov 23 '22 at 04:31
  • In slurm you should specify how many processes your program needs. You have done this by specifying '--tasks-per-node 20'. You can set the number of threads each process gets by setting 'cpus-per-task 1'. Have you set this already? – nameiki Nov 23 '22 at 09:27
  • Yes, I'd already specified that in the script. – Paul Levy Nov 25 '22 at 22:04

0 Answers0