25

I stumbled across pandas and it looks ideal for simple calculations that I'd like to do. I have a SAS background and was thinking it'd replace proc freq -- it looks like it'll scale to what I may want to do in the future. However, I just can't seem to get my head around a simple task (I'm not sure if I'm supposed to look at pivot/crosstab/indexing - whether I should have a Panel or DataFrames etc...). Could someone give me some pointers on how to do the following:

I have two CSV files (one for year 2010, one for year 2011 - simple transactional data) - The columns are category and amount

2010:

AB,100.00
AB,200.00
AC,150.00
AD,500.00

2011:

AB,500.00
AC,250.00
AX,900.00

These are loaded into separate DataFrame objects.

What I'd like to do is get the category, the sum of the category, and the frequency of the category, eg:

2010:

AB,300.00,2
AC,150.00,1
AD,500.00,1

2011:

AB,500.00,1
AC,250.00,1
AX,900.00,1

I can't work out whether I should be using pivot/crosstab/groupby/an index etc... I can get either the sum or the frequency - I can't seem to get both... It gets a bit more complex because I would like to do it on a month by month basis, but I think if someone would be so kind to point me to the right technique/direction I'll be able to go from there.

cs95
  • 379,657
  • 97
  • 704
  • 746
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • So are you saying that each `.csv` file is just a single row, and then in that single row the first value is the year followed by the data as you present it above? – benjaminmgross Mar 07 '12 at 15:08
  • Hi Factor3, that's just the way S/O decided to format it (first time I've used it, so will have to look out for that in future)... Let me clarify... there's two files - 2010.csv and 2011.csv; these contain 'n' many rows each of which contain two columns. I was trying to simplify the question - but do agree the formatting is somewhat misleading now that I've read it back! – Jon Clements Mar 08 '12 at 00:10
  • I've provided several detailed examples and alternative approaches in this [**Q&A**](https://stackoverflow.com/q/47152691/2336654) that you or others might find helpful. – piRSquared Nov 11 '17 at 22:21

2 Answers2

17

v0.21 answer

Use pivot_table with the index parameter:

df.pivot_table(index='category', aggfunc=[len, sum])

           len   sum
         value value
category            
AB           2   300
AC           1   150
AD           1   500

<= v0.12

It is possible to do this using pivot_table for those interested:

In [8]: df
Out[8]: 
  category  value
0       AB    100
1       AB    200
2       AC    150
3       AD    500

In [9]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[9]: 
            len    sum
          value  value
category              
AB            2    300
AC            1    150
AD            1    500

Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:

In [12]: df
Out[12]: 
  category  value  value2
0       AB    100       5
1       AB    200       5
2       AC    150       5
3       AD    500       5

In [13]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[13]: 
            len            sum        
          value  value2  value  value2
category                              
AB            2       2    300      10
AC            1       1    150       5
AD            1       1    500       5

The main reason to use __builtin__.sum vs. np.sum is that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.

cs95
  • 379,657
  • 97
  • 704
  • 746
Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
12

Assuming that you have a file called 2010.csv with contents

category,value
AB,100.00
AB,200.00
AC,150.00
AD,500.00

Then, using the ability to apply multiple aggregation functions following a groupby, you can say:

import pandas
data_2010 = pandas.read_csv("/path/to/2010.csv")
data_2010.groupby("category").agg([len, sum])

You should get a result that looks something like

          value     
            len  sum
category            
AB            2  300
AC            1  150
AD            1  500

Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.

Jeff Hammerbacher
  • 4,226
  • 2
  • 29
  • 36
  • That's the push I needed - TY. I was trying all sorts with pivot_table(data_2010, rows='???', aggfunc={'???': '???'}) etc... I had a feeling I was over-complicating the problem. Thanks again. – Jon Clements Mar 08 '12 at 19:07