dask dataframe optimal partition size for 70GB data join operations

Question

I have a dask dataframe of around 70GB and 3 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster.

I have to take each of the 3 columns and join them to another even larger dataframe.

The documentation recommends to have partition sizes of 100MB. However, given this size of data, joining 700 partitions seems to be a lot more work than for example joining 70 partitions a 1000MB.

Is there a reason to keep it at 700 x 100MB partitions? If not which partition size should be used here? Does this also depend on the number of workers I use?

1 x 50GB worker
2 x 25GB worker
3 x 17GB worker

score 2 · Accepted Answer · answered Dec 01 '19 at 19:05

Optimal partition size depends on many different things, including available RAM, the number of threads you're using, how large your dataset is, and in many cases the computation that you're doing.

For example, in your case if your join/merge code it could be that your data is highly repetitive, and so your 100MB partitions may quickly expand out 100x to 10GB partitions, and quickly fill up memory. Or they might not; it depends on your data. On the other hand join/merge code does produce n*log(n) tasks, so reducing the number of tasks (and so increasing partition size) can be highly advantageous.

Determining optimal partition size is challenging. Generally the best we can do is to provide insight about what is going on. That is available here:

thanks, Matthew. I got the join to work, but I had to reduce to 1 worker and 1-2 threads on a local cluster, otherwise I always got the out of memory warning. I am using partitions sizes of 1GB each, this seems to work. Its very furstrating for me as a user, that I have to do trial and error with different partition sizes, I would have wished, dask would analyze my computation graph and then propose an optimal partitioning to me. I have another issue with set_index() out of memory, I will do a different post and hope for your help. thanks! — user670186, Dec 02 '19 at 14:08

dask dataframe optimal partition size for 70GB data join operations

1 Answers1

Linked