I was inspired by both Victor and Martin's answers to dig a little deeper, so here's an in-depth summary of my understanding. (couldn't do it in a comment)
First, note that the scheduler printout in this version of dask isn't quite intuitive. processes
is actually the number of workers, cores
is actually the total number of threads in all workers.
Secondly, Victor's comments about the TCP address and adding/connecting more workers are good to point out. I'm not sure if more workers could be added to a cluster with processes=False
, but I think the answer is probably yes.
Now, consider the following script:
from dask.distributed import Client
if __name__ == '__main__':
with Client(processes=False) as client: # Config 1
print(client)
with Client(processes=False, n_workers=4) as client: # Config 2
print(client)
with Client(processes=False, n_workers=3) as client: # Config 3
print(client)
with Client(processes=True) as client: # Config 4
print(client)
with Client(processes=True, n_workers=3) as client: # Config 5
print(client)
with Client(processes=True, n_workers=3,
threads_per_worker=1) as client: # Config 6
print(client)
This produces the following output in dask
version 2.3.0 for my laptop (4 cores):
<Client: scheduler='inproc://90.147.106.86/14980/1' processes=1 cores=4>
<Client: scheduler='inproc://90.147.106.86/14980/9' processes=4 cores=4>
<Client: scheduler='inproc://90.147.106.86/14980/26' processes=3 cores=6>
<Client: scheduler='tcp://127.0.0.1:51744' processes=4 cores=4>
<Client: scheduler='tcp://127.0.0.1:51788' processes=3 cores=6>
<Client: scheduler='tcp://127.0.0.1:51818' processes=3 cores=3>
Here's my understanding of the differences between the configurations:
- The scheduler and all workers are run as threads within the Client process. (As Martin said, this is useful for introspection.) Because neither the number of workers or the number of threads/worker is given,
dask
calls its function nprocesses_nthreads()
to set the defaults (with processes=False
, 1 process and threads equal to available cores).
- Same as 1, but since
n_workers
was given, the threads/workers is chosen by dask such that the total number of threads is equal to the number of cores (i.e., 1). Again, processes
in the print output is not exactly correct -- it's actually the number of workers (which in this case are actually threads).
- Same as 2, but since
n_workers
doesn't divide equally into the number of cores, dask chooses 2 threads/worker to overcommit instead of undercommit.
- The Client, Scheduler and all workers are separate processes. Dask chooses the default number of workers (equal to cores because it's <= 4) and the default number of threads/worker (1).
- Same processes/thread configuration as 5, but the total threads are overprescribed for the same reason as 3.
- This behaves as expected.