Groupby.transform doesn't work in dask dataframe

Question

i'm using the following dask.dataframe AID:

   AID FID  ANumOfF
0    1   X        1
1    1   Y        5
2    2   Z        6
3    2   A        1
4    2   X       11
5    2   B       18

I know in a pandas dataframe I could use:

AID.groupby('AID')['ANumOfF'].transform('sum')

to get:

I want to use the same with dask.dataframes which usually uses same functions as a pandas dataframe, but in this instance gives me the following error:

AttributeError: 'SeriesGroupBy' object has no attribute 'transform'

It could either be one of two things, either that dask doesn't support it, or it's because I'm using python 3?

I tried the following code:

AID.groupby('AID')['ANumOfF'].sum()

but that just gives me the sum of each group like this:

AID
1     6
2    36

I need it to be as the above where a sum is repeated in each row. My question is, if transform isn't supported, is there another way I could achieve the same result?

related: https://stackoverflow.com/questions/19267029/why-pandas-transform-fails-if-you-only-have-a-single-column — EdChum, Apr 04 '17 at 12:57
Hi Ed, in the link it says that the above should work in you have two columns, and I do have two columns and it does work with a pandas dataframe. My issue is that I have a dask dataframe, which doesn't seem to support transform. Is there a way to achieve what transform does without using transform? — BKS, Apr 04 '17 at 13:01
I've no experience with dask dfs, does this work: `AID.groupby('AID')[['ANumOfF']].transform('sum')`? this in pandas land would force a single column df to be called — EdChum, Apr 04 '17 at 13:02
As of April 2017, Dask.dataframe groupby objects do not support the transform method. You may want to [raise an issue](https://github.com/dask/dask/issues/new) to request it. — MRocklin, Apr 04 '17 at 13:10
EdChum, this works in pandas dataframe yes. But my data is so large I can't use pandas, and therefore have switched to dask. — BKS, Apr 04 '17 at 13:21

jezrael · Accepted Answer · 2017-04-04T13:08:36.703

9

I think you can use join:

s = AID.groupby('AID')['ANumOfF'].sum()
AID = AID.set_index('AID').drop('ANumOfF', axis=1).join(s).reset_index()
print (AID)
   AID FID  ANumOfF
0    1   X        6
1    1   Y        6
2    2   Z       36
3    2   A       36
4    2   X       36
5    2   B       36

Or faster solution with map by aggregate Series or dict:

s = AID.groupby('AID')['ANumOfF'].sum()
#a bit faster
#s = AID.groupby('AID')['ANumOfF'].sum().to_dict()
AID['ANumOfF'] = AID['AID'].map(s)
print (AID)
   AID FID  ANumOfF
0    1   X        6
1    1   Y        6
2    2   Z       36
3    2   A       36
4    2   X       36
5    2   B       36

edited Apr 04 '17 at 13:08

answered Apr 04 '17 at 13:02

jezrael

822,522
95
1,334
1,252

Do you know how to map the results back to the dataframe for a multi-column groupby? I'd be happy to open this as another question if you think that's appropriate. – SummerEla Feb 14 '19 at 02:32

skibee · Answer 2 · 2021-01-27T09:06:55.640

0

Currently Dask supports transform , howerver there may be an issues with indexes (depending on original dataframe). see this PR #5327

So your code should work

AID.groupby('AID')['ANumOfF'].transform('sum')

edited Jan 27 '21 at 09:06

answered Jan 26 '21 at 14:36

skibee

1,279
1
17
37

Groupby.transform doesn't work in dask dataframe

2 Answers2