0

I'm trying to filter a Dask dataframe with groupby.

df = df.set_index('ngram');
sizes = df.groupby('ngram').size();
df = df[sizes > 15];

However, df.head(15) throws the error ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.. The divisions on sizes are not known:

>>> df.known_divisions
True
>>> sizes.known_divisions
False

A workaround is to do sizes.compute() or .to_csv(...) and then read it back to Dask with dd.from_pandas or dd.read_csv. Then sizes.known_divisions would return True. That's a notable inconvenience.

How else can this be solved? Am I using Dask wrong?

Note: there's an unanswered dublicate here.

Dominykas Mostauskis
  • 7,797
  • 3
  • 48
  • 67

1 Answers1

1

In the common case you are using, it appears to be that your indexing series is in fact much smaller than the source dataframe you want to apply it to. In this case, it makes sense to materialise it and use simple indexing like this:

df = pd.DataFrame({'ngram': np.random.choice([1, 2, 3], size=1000),
     'other': np.random.randn(1000)})  # fake data
d = dd.from_pandas(df, npartitions=3)
sizes = d.groupby('ngram').size().compute()
d = d.set_index('ngram')  # also sorts the divisions
ngrams = sizes[sizes > 300].index.tolist()  # a list of good ngrams
d.loc[ngrams].compute()
mdurant
  • 27,272
  • 5
  • 45
  • 74