pandas groupby aggregate with grand total in the bottom

Question

Here is my code:

import StringIO
from pandas import *
import numpy as np

df = read_csv(StringIO.StringIO('''Col1 Col2    A   B
A   D   1   6
A   E   2   7
B   D   3   8
B   E   4   9
C   D   5   19'''), delimiter='\t')


df['buc1'] = cut(df['A'], bins = [0, 2, 6, 8])

aggFunc = {'A': sum,
           'B': np.average
           }

After running:

df.groupby(['buc1']).agg(aggFunc)

I get:

         A     B
buc1            
(0, 2]   3   6.5
(2, 6]  12  12.0
(6, 8] NaN   NaN

My questions are:

How do I get rid of the bottom (6, 8] bucket, given that there are no values there?
How do I add a grand total row at the bottom, as in a pivot table in Excel?

For instance:

buc1    A   B
(0, 2]  3   6.5
(2, 6]  12  12
Total   15  9.8

Note that the total row for the second column would be the average, not the sum.

Are you asking how to get rid of NA rows generally, or if there is a method for the `agg` method to leave them out? — ako, Oct 27 '15 at 17:35
Only the second question is interesting ;) It's curious how `groupby` keeps one of the bins generated by `cut` even though the resulting dataframe does not have this particular bin. — IanS, Oct 27 '15 at 17:50

score 8 · Answer 1 · edited May 23 '17 at 12:08

8

To just drop the na records, you can use the .dropna() dataframe method.

df['buc1'] = df['buc1'].astype(object) 
result = df.groupby(['buc1']).agg(aggFunc).dropna()
result


         A     B
buc1            
(0, 2]   3   6.5
(2, 6]  12  12.0

As for the marginal totals, in theory this should work:

result_alternative = pd.pivot_table(df,index='buc1',
                                    aggfunc=aggFunc,
                                    values=['A','B'],
                                    margins=True)
result_alternative

         A     B
buc1            
(0, 2]   3   6.5
(2, 6]  12  12.0
All     15   9.8

But it throws an error related to , which I think is a bug that should be fixed so it may work on your machine (although I am running version 0.17)

That same link has a workaround related to the categorical variable--just cast it as an object first, and the above call should be good.

EDIT:

If you want to stay with the groupby semantics, you can add a row total like so:

rowtotal = result.sum()
rowtotal.name = 'rowtotal'
result.append(rowtotal)

edited May 23 '17 at 12:08

Community

1
1

answered Oct 27 '15 at 17:51

ako

3,569
4
27
38

But your solution for the second question would return the sum for both columns, whereas I think OP wants the sum in the first column and the average in the second (like his `aggfunc` does). – IanS Oct 27 '15 at 17:54
Also, any idea why the "phantom" bin appears in the `groupby` dataframe? – IanS Oct 27 '15 at 17:56
Thanks for the dropna solution. df.groupby(['buc1']).agg(aggFunc).dropna(how='all') works better. – Jason Oct 27 '15 at 17:58
1

@IanS, fixed the aggfunction parameter. And the `phantom bin` is because that was a part of the pd.cut bin call, and is a part of the `categorical` datatype – ako Oct 27 '15 at 18:00

pandas groupby aggregate with grand total in the bottom

1 Answers1