I am having a strange bug with a pandas groupby in one of my databricks notebook.
Data are confidential, so this is a dummy illustration of my bug (the dataframe df is actually the result from a merge of two other dataframe).
If you want to reproduce the dataframe :
data = {'group1': ['a', 'b','a','a','a'],
'group2': ['f', 'f', 'f' , 'f', 'f'],
'aggregate': ['1', '2','3','4','5'],}
df = pd.DataFrame (data, columns = ['group1','group2','aggregate'])
At this stage , the dataframe df is displayed correctly. Now i am doing a groupby :
agg = df.groupby(['group2', 'group1'], as_index=False).agg({'aggregate':', '.join})
I should be getting this :
But I am getting this :
ValueError: Length of values does not match length of index
Only ways "to make it work" is :
Fix 1 : agg = df.groupby(['group2', 'group1'], as_index=True).agg({'aggregate':', '.join}).reset_index()
And i am getting this :
Fix 2 : After the initial merge, "reset the dataframe", to have a new fresh one. This works perfectly but is not really nice.
df = pd.DataFrame.from_dict(drift.to_dict())
Is my data somehow corrupted ? How ?
Any level of feedback would be much appreciated (whether you know the the reason of the bug - which would be great! - or not), just so that I understand a bit better what could be happening behind the scenes.
Very much looking forward any suggestion or opinion here. Thank you !