7

i'm using the following dask.dataframe AID:

   AID FID  ANumOfF
0    1   X        1
1    1   Y        5
2    2   Z        6
3    2   A        1
4    2   X       11
5    2   B       18

I know in a pandas dataframe I could use:

AID.groupby('AID')['ANumOfF'].transform('sum')

to get:

0     6
1     6
2    36
3    36
4    36
5    36

I want to use the same with dask.dataframes which usually uses same functions as a pandas dataframe, but in this instance gives me the following error:

AttributeError: 'SeriesGroupBy' object has no attribute 'transform'

It could either be one of two things, either that dask doesn't support it, or it's because I'm using python 3?

I tried the following code:

AID.groupby('AID')['ANumOfF'].sum()

but that just gives me the sum of each group like this:

AID
1     6
2    36

I need it to be as the above where a sum is repeated in each row. My question is, if transform isn't supported, is there another way I could achieve the same result?

BKS
  • 2,227
  • 4
  • 32
  • 53
  • related: https://stackoverflow.com/questions/19267029/why-pandas-transform-fails-if-you-only-have-a-single-column – EdChum Apr 04 '17 at 12:57
  • Hi Ed, in the link it says that the above should work in you have two columns, and I do have two columns and it does work with a pandas dataframe. My issue is that I have a dask dataframe, which doesn't seem to support transform. Is there a way to achieve what transform does without using transform? – BKS Apr 04 '17 at 13:01
  • I've no experience with dask dfs, does this work: `AID.groupby('AID')[['ANumOfF']].transform('sum')`? this in pandas land would force a single column df to be called – EdChum Apr 04 '17 at 13:02
  • As of April 2017, Dask.dataframe groupby objects do not support the transform method. You may want to [raise an issue](https://github.com/dask/dask/issues/new) to request it. – MRocklin Apr 04 '17 at 13:10
  • EdChum, this works in pandas dataframe yes. But my data is so large I can't use pandas, and therefore have switched to dask. – BKS Apr 04 '17 at 13:21

2 Answers2

9

I think you can use join:

s = AID.groupby('AID')['ANumOfF'].sum()
AID = AID.set_index('AID').drop('ANumOfF', axis=1).join(s).reset_index()
print (AID)
   AID FID  ANumOfF
0    1   X        6
1    1   Y        6
2    2   Z       36
3    2   A       36
4    2   X       36
5    2   B       36

Or faster solution with map by aggregate Series or dict:

s = AID.groupby('AID')['ANumOfF'].sum()
#a bit faster
#s = AID.groupby('AID')['ANumOfF'].sum().to_dict()
AID['ANumOfF'] = AID['AID'].map(s)
print (AID)
   AID FID  ANumOfF
0    1   X        6
1    1   Y        6
2    2   Z       36
3    2   A       36
4    2   X       36
5    2   B       36
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Do you know how to map the results back to the dataframe for a multi-column groupby? I'd be happy to open this as another question if you think that's appropriate. – SummerEla Feb 14 '19 at 02:32
0

Currently Dask supports transform , howerver there may be an issues with indexes (depending on original dataframe). see this PR #5327

So your code should work

AID.groupby('AID')['ANumOfF'].transform('sum')

skibee
  • 1,279
  • 1
  • 17
  • 37