28

I am a bit confused by the different terms used in dask and dask.distributed when setting up workers on a cluster.

The terms I came across are: thread, process, processor, node, worker, scheduler.

My question is how to set the number of each, and if there is a strict or recommend relationship between any of these. For example:

  • 1 worker per node with n processes for the n cores on the node
  • threads and processes are the same concept? In dask-mpi I have to set nthreads but they show up as processes in the client

Any other suggestions?

kristofarkas
  • 395
  • 3
  • 9

1 Answers1

39

By "node" people typically mean a physical or virtual machine. That node can run several programs or processes at once (much like how my computer can run a web browser and text editor at once). Each process can parallelize within itself with many threads. Processes have isolated memory environments, meaning that sharing data within a process is free, while sharing data between processes is expensive.

Typically things work best on larger nodes (like 36 cores) if you cut them up into a few processes, each of which have several threads. You want the number of processes times the number of threads to equal the number of cores. So for example you might do something like the following for a 36 core machine:

  • Four processes with nine threads each
  • Twelve processes with three threads each
  • One process with thirty-six threads

Typically one decides between these choices based on the workload. The difference here is due to Python's Global Interpreter Lock, which limits parallelism for some kinds of data. If you are working mostly with Numpy, Pandas, Scikit-Learn, or other numerical programming libraries in Python then you don't need to worry about the GIL, and you probably want to prefer few processes with many threads each. This helps because it allows data to move freely between your cores because it all lives in the same process. However, if you're doing mostly Pure Python programming, like dealing with text data, dictionaries/lists/sets, and doing most of your computation in tight Python for loops then you'll want to prefer having many processes with few threads each. This incurs extra communication costs, but lets you bypass the GIL.

In short, if you're using mostly numpy/pandas-style data, try to get at least eight threads or so in a process. Otherwise, maybe go for only two threads in a process.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • By node I meant a compute node on an HPC. For my particular case a node has 16 cores and 1 GPU. The program I want to run is OpenMM, a molecular dynamics simulation engine written in C++ with bindings for python (therefore no GIL problems I assume). Each simulation requires a GPU, so 1 simulation per node. Therefore 1 worker with 1 process and 16 threads per node is what I should be using with `--resources "GPU=1"`? – kristofarkas Jun 29 '18 at 12:16
  • Yes, using worker resources seems like a sensible choice – MRocklin Jun 29 '18 at 15:52
  • It would be useful if the answer included the definition of a `LocalClient` and a remote client (e.g. `YarnCluster`) . we are using a small cluster (with 6 nodes) while each node has 64 VCores and 300GB RAM. and i'm not able to configure the `YarnCluster` to run each `vcore` as a `worker`. – skibee Apr 23 '19 at 13:09
  • I have 6 "core" machines with 12 "logical processors." Should I be shooting for a magic number of 6 or 12? – sameagol Aug 30 '19 at 06:20
  • That depends entirely on the computations that you want to run. This isn't so much a question about Dask. It's a question about how your code likes to run in parallel. For example, numpy might behave differently here than pandas. – MRocklin Aug 31 '19 at 04:06
  • Hi Matthew, consider explicitly stating in your answer the connection between processes and workers (namely, that each worker is a process). Small detail, but it tripped me up for awhile. – jrinker Sep 02 '19 at 15:55
  • @MRocklin I agree with above comment. Over some stackoverflow questions you answered I am confused with concepts of processes and workers also. I assume a worker is a process. Given that `multiprocessing.cpu_count() = 8` in a local machine, for numpy/pandas style data operations, do you mean `cluster = LocalCluster(n_workers=2, memory_limit='auto', threads_per_worker=4)` would be better than more workers less threads? – jerrytim Feb 13 '20 at 18:29