2

I am having a strange bug with a pandas groupby in one of my databricks notebook.

Data are confidential, so this is a dummy illustration of my bug (the dataframe df is actually the result from a merge of two other dataframe).

If you want to reproduce the dataframe :

data = {'group1':  ['a', 'b','a','a','a'],
        'group2': ['f', 'f', 'f' , 'f', 'f'],
         'aggregate': ['1', '2','3','4','5'],}
df = pd.DataFrame (data, columns = ['group1','group2','aggregate'])

At this stage , the dataframe df is displayed correctly. Now i am doing a groupby :

agg = df.groupby(['group2', 'group1'],  as_index=False).agg({'aggregate':', '.join})

I should be getting this :

enter image description here

But I am getting this :

ValueError: Length of values does not match length of index

Only ways "to make it work" is :

Fix 1 : agg = df.groupby(['group2', 'group1'], as_index=True).agg({'aggregate':', '.join}).reset_index()

And i am getting this :

group2 group1   aggregate
0      f      a  1, 3, 4, 5
1      f      b           2
2      f                NaN

Fix 2 : After the initial merge, "reset the dataframe", to have a new fresh one. This works perfectly but is not really nice.

df = pd.DataFrame.from_dict(drift.to_dict())

Is my data somehow corrupted ? How ?

Any level of feedback would be much appreciated (whether you know the the reason of the bug - which would be great! - or not), just so that I understand a bit better what could be happening behind the scenes.

Very much looking forward any suggestion or opinion here. Thank you !

Amir Maleki
  • 389
  • 1
  • 2
  • 14
OrganicMustard
  • 1,158
  • 1
  • 15
  • 36

2 Answers2

4

Pandas throws that error when one of the "groupby columns" has the category type.

A workaround could be calling astype and choosing a literal, e.g. string:

df = df.astype({'group1': 'string', 'group2': 'string'})

As of March 2023 it is still an unresolved bug

BTW this question has probably the same error

Filippo Vitale
  • 7,597
  • 3
  • 58
  • 64
  • A better solution is provided in the github issue if you want to keep the categorical format after the groupby: use `as_index=True` and after the `agg` use `reset_index(names=['list', 'of', 'column', 'names'])` – Arthur Spoon Apr 27 '23 at 10:46
1

I tested the same code and I got the expected result: enter image description here

I tested in Google Colab. Maybe the issue is your Pandas's version. I tested your code with pandas: 1.1.5

Amir Maleki
  • 389
  • 1
  • 2
  • 14