1

I've used to use this solution to compute and store value_counts of a column in Pandas and store the results in a new column.

Now I'm trying to do the same for a Dask Dataframe, but it causes the following error:

df['new_column'] = df.groupby(['A'])['B'].transform('count', meta='int').compute()

ValueError: cannot reindex from a duplicate axis

P.S. The df dataframe has four partitions.

How can I count the value_count of column A and store them in the new_column in Dask, as same as this answer?

rpanai
  • 12,515
  • 2
  • 42
  • 64
Saeed Esmaili
  • 764
  • 3
  • 12
  • 34

1 Answers1

2

In case you don't need to stick with transform (which was introduced in the most recent dask version see issue) I suggest you to use a left merge as in the following code.


import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({"A":[0,0,1,1,1,2,2],
                   "B":[1,2,3,4,5,6,7]})

df = dd.from_pandas(df, npartitions=2)

out = df.groupby("A")["B"]\
        .count()\
        .compute()\
        .reset_index(name="new_column")

df = dd.merge(df, out, on=["A"], how="left")

rpanai
  • 12,515
  • 2
  • 42
  • 64