2

If I have these columns in a dataframe:

a     b  
1     5   
1     7
2     3
1,2   3
2     5

How do I create column c where column b is summed using groupings of column a (string), keeping the existing dataframe. Some rows can belong to more than one group.

a     b     c
1     5     15
1     7     15
2     3     11
1,2   3     26
2     5     11

Is there an easy and efficient solution as the dataframe I have is very large.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
mohitos
  • 131
  • 1
  • 1
  • 6

1 Answers1

2

You can first need split column a and join it to original DataFrame:

print (df.a.str.split(',', expand=True)
               .stack()
               .reset_index(level=1, drop=True)
               .rename('a'))
0    1
1    1
2    2
3    1
3    2
4    2
Name: a, dtype: object

df1 = df.drop('a', axis=1)
        .join(df.a.str.split(',', expand=True)
                      .stack()
                      .reset_index(level=1, drop=True)
                      .rename('a'))
print (df1)
   b  a
0  5  1
1  7  1
2  3  2
3  3  1
3  3  2
4  5  2

Then use transform for sum without aggragation.

df1['c'] = df1.groupby(['a'])['b'].transform(sum)
#cast for aggreagation join working with strings
df1['a'] = df1.a.astype(str)
print (df1)
   b  a   c
0  5  1  15
1  7  1  15
2  3  2  11
3  3  1  15
3  3  2  11
4  5  2  11

Last groupby by index and aggregate columns by agg:

print (df1.groupby(level=0)
          .agg({'a':','.join,'b':'first' ,'c':sum})
          [['a','b','c']] )

     a  b   c
0    1  5  15
1    1  7  15
2    2  3  11
3  1,2  3  26
4    2  5  11          
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • For first step can be used already [piRSquared solution](http://stackoverflow.com/a/38652414/2901002). – jezrael Aug 29 '16 at 15:15