Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
189
votes
12 answers

Make Pandas DataFrame apply() use all cores?

As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its compute-time when you run df.apply(myfunc, axis=1). How can you use all your…
Roko Mijic
  • 6,655
  • 4
  • 29
  • 36
101
votes
1 answer

At what situation I can use Dask instead of Apache Spark?

I am currently using Pandas and Spark for data analysis. I found Dask provides parallelized NumPy array and Pandas DataFrame. Pandas is easy and intuitive for doing data analysis in Python. But I find difficulty in handling multiple bigger…
Hariprasad
  • 1,611
  • 2
  • 14
  • 19
75
votes
5 answers

A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow. I found this blog post (a basic comparison of speeds). and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw…
moshevi
  • 4,999
  • 5
  • 33
  • 50
58
votes
3 answers

How to transform Dask.DataFrame to pd.DataFrame?

How can I transform my resulting dask.DataFrame into pandas.DataFrame (let's say I am done with heavy lifting, and just want to apply sklearn to my aggregate result)?
Philipp_Kats
  • 3,872
  • 3
  • 27
  • 44
53
votes
2 answers

python dask DataFrame, support for (trivially parallelizable) row apply?

I recently found dask module that aims to be an easy-to-use python parallel processing module. Big selling point for me is that it works with pandas. After reading a bit on its manual page, I can't find a way to do this trivially parallelizable…
jf328
  • 6,841
  • 10
  • 58
  • 82
48
votes
1 answer

Convert Pandas dataframe to Dask dataframe

Suppose I have pandas dataframe as: df=pd.DataFrame({'a':[1,2,3],'b':[4,5,6]}) When I convert it into dask dataframe what should name and divisions parameter consist of: from dask import dataframe as dd…
rey
  • 1,213
  • 3
  • 11
  • 14
43
votes
1 answer

Out-of-core processing of sparse CSR arrays

How can one apply some function in parallel on chunks of a sparse CSR array saved on disk using Python? Sequentially this could be done e.g. by saving the CSR array with joblib.dump opening it with joblib.load(.., mmap_mode="r") and processing the…
rth
  • 10,680
  • 7
  • 53
  • 77
34
votes
2 answers

Writing Dask partitions into single file

New to dask,I have a 1GB CSV file when I read it in dask dataframe it creates around 50 partitions after my changes in the file when I write, it creates as many files as partitions. Is there a way to write all partitions to single CSV file and is…
rey
  • 1,213
  • 3
  • 11
  • 14
33
votes
2 answers

An attempt has been made to start a new process before the current process has finished its bootstrapping phase

I am new to dask and I found so nice to have a module that makes it easy to get parallelization. I am working on a project where I was able to parallelize in a single machine a loop as you can see here . However, I would like to move over to…
muammar
  • 951
  • 2
  • 13
  • 32
32
votes
4 answers

Strategy for partitioning dask dataframes efficiently

The documentation for Dask talks about repartioning to reduce overhead here. They however seem to indicate you need some knowledge of what your dataframe will look like beforehand (ie that there will 1/100th of the data expected). Is there a good…
Samantha Hughes
  • 593
  • 1
  • 6
  • 13
32
votes
2 answers

Can dask parralelize reading fom a csv file?

I'm converting a large textfile to a hdf storage in hopes of a faster data access. The conversion works allright, however reading from the csv file is not done in parallel. It is really slow (takes about 30min for a 1GB textfile on an SSD, so my…
Magellan88
  • 2,543
  • 3
  • 24
  • 36
32
votes
2 answers

Read a large csv into a sparse pandas dataframe in a memory efficient way

The pandas read_csv function doesn't seem to have a sparse option. I have csv data with a ton of zeros in it (it compresses very well, and stripping out any 0 value reduces it to almost half the original size). I've tried loading it into a dense…
31
votes
1 answer

Converting numpy solution into dask (numpy indexing doesn't work in dask)

I'm trying to convert my monte carlo simulation from numpy into dask, because sometimes the arrays are too large, and can't fit into the memory. Therefore I set up a cluster of computers in the cloud: My dask cluster consists of 24 cores and 94 GB…
patex1987
  • 407
  • 1
  • 5
  • 10
31
votes
2 answers

How to see progress of Dask compute task?

I would like to see a progress bar on Jupyter notebook while I'm running a compute task using Dask, I'm counting all values of id column from a large csv file +4GB, so any ideas? import dask.dataframe as dd df =…
ambigus9
  • 1,417
  • 3
  • 19
  • 37
30
votes
2 answers

Dask: How would I parallelize my code with dask delayed?

This is my first venture into parallel processing and I have been looking into Dask but I am having trouble actually coding it. I have had a look at their examples and documentation and I think dask.delayed will work best. I attempted to wrap my…
1
2 3
99 100