6

I'm trying to identify the best way to make a simple pivot on my data:

import pandas    
dfn = pandas.DataFrame({
    "A" : [ 'aaa', 'bbb', 'aaa', 'bbb' ],
    "B" : [     1,    10,     2,   30  ],
    "C" : [     2,     0,     3,   20  ] })

The output I would like to have is a dataframe, grouped by A, that sum and count values of B and C, and names have to be exactly (Sum_B, Sum_C, Count), as following:

A   Sum_B  Sum_C  Count
aaa    3      5       2
bbb   50     20       2

What is the fastest way to do this?

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
DPColombotto
  • 159
  • 1
  • 3
  • 11

3 Answers3

5

you can use .agg() function:

In [227]: dfn.groupby('A').agg({'B':sum, 'C':sum, 'A':'count'}).rename(columns={'A':'count'})
Out[227]:
      B  count   C
A
aaa   3      2   5
bbb  40      2  20

or with reset_index():

In [239]: dfn.groupby('A').agg({'B':sum, 'C':sum, 'A':'count'}).rename(columns={'A':'count'}).reset_index()
Out[239]:
     A   B  count   C
0  aaa   3      2   5
1  bbb  40      2  20

PS Here is a link to examples provided by @evan54

Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 1
    didn't know about `agg` (aggregate), this is a nice description http://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/ – evan54 Jun 22 '16 at 10:42
  • I add also .reset_index() in order to "move" 'A' to a column, since i have to write a csv – DPColombotto Jun 22 '16 at 10:48
3

I think you can use aggregate with double sum and len, rename columns and last change order of columns by reindex_axis:

print (dfn.groupby('A')
          .agg({'B': sum, 'C':sum, 'A':len})
          .rename(columns={'A': 'Count', 'B': 'Sum_B', 'C': 'Sum_C'})
          .reindex_axis(['Sum_B','Sum_C','Count'], axis=1))  

     Sum_B  Sum_C  Count
A                       
aaa      3      5      2
bbb     40     20      2

If need reset_index:

print (dfn.groupby('A')
          .agg({'B': sum, 'C':sum, 'A':len})
          .rename(columns={'A': 'Count', 'B': 'Sum_B', 'C': 'Sum_C'})
          .reindex_axis(['Sum_B','Sum_C','Count'], axis=1)
          .reset_index())  

     A  Sum_B  Sum_C  Count
0  aaa      3      5      2
1  bbb     40     20      2

Instead len you can use size, what is same:

print (dfn.groupby('A')
          .agg({'B': sum, 'C':sum, 'A':'size'})
          .rename(columns={'A': 'Count', 'B': 'Sum_B', 'C': 'Sum_C'})
          .reindex_axis(['Sum_B','Sum_C','Count'], axis=1)
          .reset_index())  

     A  Sum_B  Sum_C  Count
0  aaa      3      5      2
1  bbb     40     20      2      

Differences between size and count (see differences with numeric values):

Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

One way to do this is to group by the column you want and then sum as shown below.

r = dfn.groupby('A').sum()
r['counts'] = dfn.A.value_counts() # this could also be written as dfn['A'].value_counts()

The first line computes the sums and the second line gets the number of times each of the values appears in A.

evan54
  • 3,585
  • 5
  • 34
  • 61