9

Here is my code:

import StringIO
from pandas import *
import numpy as np

df = read_csv(StringIO.StringIO('''Col1 Col2    A   B
A   D   1   6
A   E   2   7
B   D   3   8
B   E   4   9
C   D   5   19'''), delimiter='\t')


df['buc1'] = cut(df['A'], bins = [0, 2, 6, 8])

aggFunc = {'A': sum,
           'B': np.average
           }

After running:

df.groupby(['buc1']).agg(aggFunc)

I get:

         A     B
buc1            
(0, 2]   3   6.5
(2, 6]  12  12.0
(6, 8] NaN   NaN

My questions are:

  1. How do I get rid of the bottom (6, 8] bucket, given that there are no values there?
  2. How do I add a grand total row at the bottom, as in a pivot table in Excel?

For instance:

buc1    A   B
(0, 2]  3   6.5
(2, 6]  12  12
Total   15  9.8

Note that the total row for the second column would be the average, not the sum.

IanS
  • 15,771
  • 9
  • 60
  • 84
Jason
  • 131
  • 1
  • 1
  • 4
  • Are you asking how to get rid of NA rows generally, or if there is a method for the `agg` method to leave them out? – ako Oct 27 '15 at 17:35
  • I guess both if there are available. – Jason Oct 27 '15 at 17:40
  • Only the second question is interesting ;) It's curious how `groupby` keeps one of the bins generated by `cut` even though the resulting dataframe does not have this particular bin. – IanS Oct 27 '15 at 17:50

1 Answers1

8

To just drop the na records, you can use the .dropna() dataframe method.

df['buc1'] = df['buc1'].astype(object) 
result = df.groupby(['buc1']).agg(aggFunc).dropna()
result


         A     B
buc1            
(0, 2]   3   6.5
(2, 6]  12  12.0

As for the marginal totals, in theory this should work:

result_alternative = pd.pivot_table(df,index='buc1',
                                    aggfunc=aggFunc,
                                    values=['A','B'],
                                    margins=True)
result_alternative

         A     B
buc1            
(0, 2]   3   6.5
(2, 6]  12  12.0
All     15   9.8

But it throws an error related to , which I think is a bug that should be fixed so it may work on your machine (although I am running version 0.17)

That same link has a workaround related to the categorical variable--just cast it as an object first, and the above call should be good.

EDIT:

If you want to stay with the groupby semantics, you can add a row total like so:

rowtotal = result.sum()
rowtotal.name = 'rowtotal'
result.append(rowtotal)
Community
  • 1
  • 1
ako
  • 3,569
  • 4
  • 27
  • 38
  • But your solution for the second question would return the sum for both columns, whereas I think OP wants the sum in the first column and the average in the second (like his `aggfunc` does). – IanS Oct 27 '15 at 17:54
  • Also, any idea why the "phantom" bin appears in the `groupby` dataframe? – IanS Oct 27 '15 at 17:56
  • Thanks for the dropna solution. df.groupby(['buc1']).agg(aggFunc).dropna(how='all') works better. – Jason Oct 27 '15 at 17:58
  • 1
    @IanS, fixed the aggfunction parameter. And the `phantom bin` is because that was a part of the pd.cut bin call, and is a part of the `categorical` datatype – ako Oct 27 '15 at 18:00