Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.
Questions tagged [dask-distributed]
1090 questions
33
votes
2 answers
An attempt has been made to start a new process before the current process has finished its bootstrapping phase
I am new to dask and I found so nice to have a module that makes it easy to get parallelization. I am working on a project where I was able to parallelize in a single machine a loop as you can see here . However, I would like to move over to…

muammar
- 951
- 2
- 13
- 32
31
votes
1 answer
Converting numpy solution into dask (numpy indexing doesn't work in dask)
I'm trying to convert my monte carlo simulation from numpy into dask, because sometimes the arrays are too large, and can't fit into the memory. Therefore I set up a cluster of computers in the cloud: My dask cluster consists of 24 cores and 94 GB…

patex1987
- 407
- 1
- 5
- 10
28
votes
1 answer
Best practices in setting number of dask workers
I am a bit confused by the different terms used in dask and dask.distributed when setting up workers on a cluster.
The terms I came across are: thread, process, processor, node, worker, scheduler.
My question is how to set the number of each, and if…

kristofarkas
- 395
- 3
- 9
28
votes
1 answer
how do we choose --nthreads and --nprocs per worker in dask distributed?
How do we choose --nthreads and --nprocs per worker in Dask distributed? I have 3 workers, with 4 cores and one thread per core on 2 workers and 8 cores on 1 worker (according to the output of lscpu Linux command on each worker).

Harish Rajula
- 699
- 6
- 11
19
votes
2 answers
What is the "right" way to close a Dask LocalCluster?
I am trying to use dask-distributed on my laptop using a LocalCluster, but I have still not found a way to let my application close without raising some warnings or triggering some strange iterations with matplotlib (I am using the tkAgg…

SteP
- 262
- 1
- 2
- 9
19
votes
1 answer
How to use all the cpu cores using Dask?
I have a pandas series with more than 35000 rows. I want to use dask make it more efficient. However, I both the dask code and the pandas code are taking the same time.
Initially "ser" is pandas series and fun1 and fun2 are basic functions…

ANKIT JHA
- 359
- 1
- 3
- 9
15
votes
2 answers
Dask dataframe split partitions based on a column or function
I have recently begun looking at Dask for big data.
I have a question on efficiently applying operations in parallel.
Say I have some sales data like this:
customerKey productKey transactionKey grossSales netSales unitVolume …

Roger Thomas
- 822
- 1
- 7
- 17
13
votes
4 answers
keep getting "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time..."
I keep getting "distributed.utils_perf - WARNING - full garbage collections took 19% CPU time recently" warning message after I finished DASK code. I am using DASK doing a large seismic data computing. After the computing, I will write the computed…

NSJ
- 145
- 1
- 5
13
votes
3 answers
Difference between dask.distributed LocalCluster with threads vs. processes
What is the difference between the following LocalCluster configurations for dask.distributed?
Client(n_workers=4, processes=False, threads_per_worker=1)
versus
Client(n_workers=1, processes=True, threads_per_worker=4)
They both have four threads…

jrinker
- 2,010
- 2
- 14
- 17
13
votes
1 answer
How best to rechunk a NetCDF file collection to Zarr dataset
I'm trying to rechunk a NetCDF file collection and create a Zarr dataset on AWS S3. I have 168 original NetCDF4 classic files with arrays of dimension time: 1, y: 3840, x: 4608 chunked as chunks={'time':1, 'y':768, 'x':922}.
I want to write this…

Rich Signell
- 14,842
- 4
- 49
- 77
12
votes
1 answer
What causes dask job failure with CancelledError exception
I have been seeing below error message for quite some time now but could not figure out what leads to the failure.
Error:
concurrent.futures._base.CancelledError: ('sort_index-f23b0553686b95f2d91d4a3fda85f229', 7)
On restart of dask cluster it runs…

Santosh Kumar
- 761
- 5
- 28
11
votes
1 answer
Sorting in Dask
I want to find an alternative of pandas.dataframe.sort_value function in dask.
I came through set_index, but it would sort on a single column.
How can I sort multiple columns of Dask data frame?

Dhruv Kumar
- 399
- 2
- 13
10
votes
2 answers
dask-worker memory kept between tasks
Intro
I am parallelising some code using dask.distributed (embarrassingly parallel task).
I have a list of Paths pointing to different images that I scatter to workers.
Each worker loads and filters an image (3D stack) and run some filtering. 3D…

s1mc0d3
- 523
- 2
- 15
10
votes
3 answers
Dask: nunique method on Dataframe groupBy
I would like to know if it is possible to have the number of unique items from a given column after a groupBy aggregation with Dask. I don't see anything like this in the documentation. It is available on pandas dataframe and really useful. I've…

Guillaume EB
- 317
- 2
- 12
10
votes
1 answer
ValueError: Not all divisions are known, can't align partitions error on dask dataframe
I have the following pandas dataframe with the following columns
user_id user_agent_id requests
All columns contain integers. I wan't to perform some operations on them and run them using dask dataframe. This is what I do.
user_profile =…

Apostolos
- 7,763
- 17
- 80
- 150