I have a dask dataframe of around 70GB and 3 columns that does not fit into memory. My machine is an 8 CORE Xeon with 64GB of Ram with a local Dask Cluster.
I have to take each of the 3 columns and join them to another even larger dataframe.
The documentation recommends to have partition sizes of 100MB. However, given this size of data, joining 700 partitions seems to be a lot more work than for example joining 70 partitions a 1000MB.
Is there a reason to keep it at 700 x 100MB partitions? If not which partition size should be used here? Does this also depend on the number of workers I use?
- 1 x 50GB worker
- 2 x 25GB worker
- 3 x 17GB worker