1

I've been trying to get nlargest rows for a group by following method from this question. The solution to the question is correct up to a point.

In this example, I groupby column A and want to return the rows of C and D based on the top two values in B.

For some reason the index of grp_df is multilevel and includes both A and the original index of ddf.

I was hoping to simply reset_index() and drop the unwanted index and just keep A, but I get the following error:

ValueError: The columns in the computed data do not match the columns in the provided metadata

Here is a simple example reproducing the error:

import numpy as np
import dask.dataframe as dd
import pandas as pd

np.random.seed(42)

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

ddf = dd.from_pandas(df, npartitions=3)

grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta={
    "B": 'f8', "C": 'f8'})

# Print is successful and results are correct
print(grp_df.head())

grp_df = grp_df.reset_index()

# Print is unsuccessful and shows error below
print(grp_df.head())
Korean_Of_the_Mountain
  • 1,428
  • 3
  • 16
  • 40

1 Answers1

1

Found approach for solution here.

Following code now allows for reset_index() to work and gets rid of the original ddf index. Still not sure why the original ddf index came through the groupby in the first place, though

meta = pd.DataFrame(columns=['B', 'C'], dtype=int, index=pd.MultiIndex([[], []], [[], []], names=['A', None]))
grp_df = ddf.groupby('A')[['B','C']].apply(lambda x: x.nlargest(2, columns=['B']), meta=meta)

grp_df = grp_df.reset_index().drop('level_1', axis=1)
Korean_Of_the_Mountain
  • 1,428
  • 3
  • 16
  • 40