Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

189

votes

12 answers

Make Pandas DataFrame apply() use all cores?

As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its compute-time when you run df.apply(myfunc, axis=1). How can you use all your…

pandas dask

asked Aug 07 '17 at 10:49

Roko Mijic

6,655
4
29
36

101

votes

1 answer

At what situation I can use Dask instead of Apache Spark?

I am currently using Pandas and Spark for data analysis. I found Dask provides parallelized NumPy array and Pandas DataFrame. Pandas is easy and intuitive for doing data analysis in Python. But I find difficulty in handling multiple bigger…

python pandas apache-spark dask

asked Aug 10 '16 at 20:11

Hariprasad

1,611
2
14
19

votes

5 answers

A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw…

python parquet dask pyarrow fastparquet

asked Jul 16 '18 at 12:00

moshevi

4,999
5
33
50

votes

3 answers

How to transform Dask.DataFrame to pd.DataFrame?

How can I transform my resulting dask.DataFrame into pandas.DataFrame (let's say I am done with heavy lifting, and just want to apply sklearn to my aggregate result)?

python pandas dask

asked Aug 18 '16 at 00:56

Philipp_Kats

3,872
3
27
44

votes

2 answers

python dask DataFrame, support for (trivially parallelizable) row apply?

I recently found dask module that aims to be an easy-to-use python parallel processing module. Big selling point for me is that it works with pandas. After reading a bit on its manual page, I can't find a way to do this trivially parallelizable…

python pandas parallel-processing dask

asked Jul 11 '15 at 20:52

jf328

6,841
10
58
82

votes

1 answer

Convert Pandas dataframe to Dask dataframe

Suppose I have pandas dataframe as: df=pd.DataFrame({'a':[1,2,3],'b':[4,5,6]}) When I convert it into dask dataframe what should name and divisions parameter consist of: from dask import dataframe as dd…

python pandas dataframe data-conversion dask

asked Sep 27 '16 at 10:04

rey

1,213
3
11
14

votes

1 answer

Out-of-core processing of sparse CSR arrays

How can one apply some function in parallel on chunks of a sparse CSR array saved on disk using Python? Sequentially this could be done e.g. by saving the CSR array with joblib.dump opening it with joblib.load(.., mmap_mode="r") and processing the…

python scipy apache-spark-mllib dask joblib

asked Jul 17 '17 at 13:20

rth

10,680
7
53
77

votes

2 answers

Writing Dask partitions into single file

New to dask,I have a 1GB CSV file when I read it in dask dataframe it creates around 50 partitions after my changes in the file when I write, it creates as many files as partitions. Is there a way to write all partitions to single CSV file and is…

python dask

asked Sep 19 '16 at 06:39

rey

1,213
3
11
14

votes

2 answers

An attempt has been made to start a new process before the current process has finished its bootstrapping phase

I am new to dask and I found so nice to have a module that makes it easy to get parallelization. I am working on a project where I was able to parallelize in a single machine a loop as you can see here . However, I would like to move over to…

python dask dask-distributed

asked Mar 08 '19 at 06:38

muammar

votes

4 answers

Strategy for partitioning dask dataframes efficiently

The documentation for Dask talks about repartioning to reduce overhead here. They however seem to indicate you need some knowledge of what your dataframe will look like beforehand (ie that there will 1/100th of the data expected). Is there a good…

python optimization dataframe dask

asked Jun 20 '17 at 15:48

Samantha Hughes

votes

2 answers

Can dask parralelize reading fom a csv file?

I'm converting a large textfile to a hdf storage in hopes of a faster data access. The conversion works allright, however reading from the csv file is not done in parallel. It is really slow (takes about 30min for a 1GB textfile on an SSD, so my…

python csv pandas dask

asked Oct 18 '16 at 05:27

Magellan88

2,543
3
24
36

votes

2 answers

Read a large csv into a sparse pandas dataframe in a memory efficient way

The pandas read_csv function doesn't seem to have a sparse option. I have csv data with a ton of zeros in it (it compresses very well, and stripping out any 0 value reduces it to almost half the original size). I've tried loading it into a dense…

python pandas numpy scipy dask

asked Aug 08 '15 at 01:54

Josephine Moeller

votes

1 answer

Converting numpy solution into dask (numpy indexing doesn't work in dask)

I'm trying to convert my monte carlo simulation from numpy into dask, because sometimes the arrays are too large, and can't fit into the memory. Therefore I set up a cluster of computers in the cloud: My dask cluster consists of 24 cores and 94 GB…

python numpy dask dask-distributed

asked Aug 23 '18 at 11:50

patex1987

votes

2 answers

How to see progress of Dask compute task?

I would like to see a progress bar on Jupyter notebook while I'm running a compute task using Dask, I'm counting all values of id column from a large csv file +4GB, so any ideas? import dask.dataframe as dd df =…

python-3.x jupyter-notebook distributed-computing dask

asked Feb 28 '18 at 22:33

ambigus9

1,417
3
19
37

votes

2 answers

Dask: How would I parallelize my code with dask delayed?

This is my first venture into parallel processing and I have been looking into Dask but I am having trouble actually coding it. I have had a look at their examples and documentation and I think dask.delayed will work best. I attempted to wrap my…

multithreading python-3.x parallel-processing python-multiprocessing dask

asked Mar 02 '17 at 08:42

Monty

2 3

…

99 100 Next