Empirically it seems that whenever you set_index
on a Dask dataframe, Dask will always put rows with equal indexes into a single partition, even if it results in wildly imbalanced partitions.
Here is a demonstration:
import pandas as pd
import dask.dataframe as dd
users = [1]*1000 + [2]*1000 + [3]*1000
df = pd.DataFrame({'user': users})
ddf = dd.from_pandas(df, npartitions=1000)
ddf = ddf.set_index('user')
counts = ddf.map_partitions(lambda x: len(x)).compute()
counts.loc[counts > 0]
# 500 1000
# 999 2000
# dtype: int64
However, I found no guarantee of this behaviour anywhere.
I have tried to sift through the code myself but gave up. I believe one of these inter-related functions probably holds the answer:
When you set_index
, is it the case that a single index can never be in two different partitions? If not, then under what conditions does this property hold?
Bounty: I will award the bounty to an answer that draws from a reputable source. For example, referring to the implementation to show that this property has to hold.