I'm trying to filter a Dask dataframe with groupby
.
df = df.set_index('ngram');
sizes = df.groupby('ngram').size();
df = df[sizes > 15];
However, df.head(15)
throws the error ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
. The divisions on sizes
are not known:
>>> df.known_divisions
True
>>> sizes.known_divisions
False
A workaround is to do sizes.compute()
or .to_csv(...)
and then read it back to Dask with dd.from_pandas
or dd.read_csv
. Then sizes.known_divisions
would return True
. That's a notable inconvenience.
How else can this be solved? Am I using Dask wrong?
Note: there's an unanswered dublicate here.