1

Question

What is an efficient way to carry out numerical operations to hierarchical index rows?

Problem

I have a large dataframe, over 1gb, that is indexed by year and then by country code. A small subset is shown below. Each country has multiple observations per year. I'd like to take the average of a each country's observations in a year and return an overall average. The desired end result would be a dataframe indexed by year then by each countries yearly average.

Conceptually, I'd like to do something like:

df.ix[:,['x3yv_E', 'x3yv_D', 'x1yv_E', 'x1yv_D']].groupby(df.year).groupby(level=1).apply(lambda x: np.mean(x))

heres the dataset:

                  x3yv_E      x3yv_D      x1yv_E      x1yv_D
year                    
2003        12   0.000000    0.000000    0.000000    0.000000
            34   0.009953    0.001400    0.007823    0.000950
            12   0.010210    0.001136    0.008333    0.000722
            34   0.011143    0.006319    0.007520    0.006732
            72   0.018791    0.016717    0.018808    0.015206
2004        0    0.009115    0.000000    0.010243    0.000000
            38   0.009059    0.000932    0.010042    0.000573
            53   0.009618    0.001152    0.010724    0.000729
            70   0.000000    0.000000    0.000000    0.000000
            70   0.020655    0.018411    0.012965    0.011640

What I've tried

Benefits of panda's multiindex?

How to apply condition on level of pandas.multiindex?

Because of the large size of the dataframe, I'm looking to avoid loops and copying the dataframe multiple times like the solutions to the two questions above suggest.

Any ideas on an efficient solution? Thanks for taking a look!

Community
  • 1
  • 1
agconti
  • 17,780
  • 15
  • 80
  • 114

1 Answers1

1

Create the data

In [12]: df = DataFrame(randn(10,4),columns=list('ABCD'))

In [13]: df['year'] = 2003

In [14]: df['id'] = [12,34,12,34,72,0,38,53,70,70]
In [16]: df.loc[:5,'year'] = 2004

In [17]: df
Out[17]: 
          A         B         C         D  year  id
0 -1.917262  0.228599 -0.463695  0.776567  2004  12
1  2.064658 -0.716104 -1.399685  0.402077  2004  34
2 -1.282627  0.338368  0.757658 -0.114086  2004  12
3  1.190319 -1.592282  0.942431 -0.778128  2004  34
4  1.928094  0.532387 -0.352155 -0.039304  2004  72
5  0.535093 -1.655569 -0.309651  0.438992  2004   0
6  0.332428 -0.427696 -1.324072  2.158907  2003  38
7 -1.343306 -0.288373  0.544344 -1.361189  2003  53
8  0.959273 -0.420134  0.691108 -0.469833  2003  70
9  0.692352  0.101226 -0.161140 -0.100968  2003  70

Groupby year and id, then mean

In [21]: df.groupby(['year','id']).mean()
Out[21]: 
                A         B         C         D
year id                                        
2003 38  0.332428 -0.427696 -1.324072  2.158907
     53 -1.343306 -0.288373  0.544344 -1.361189
     70  0.825812 -0.159454  0.264984 -0.285401
2004 0   0.535093 -1.655569 -0.309651  0.438992
     12 -1.599944  0.283483  0.146981  0.331241
     34  1.627488 -1.154193 -0.228627 -0.188025
     72  1.928094  0.532387 -0.352155 -0.039304

By year mean

In [24]: df.groupby(['year']).mean()
Out[24]: 
             A         B         C         D         id
year                                                   
2003  0.160187 -0.258744 -0.062440  0.056729  57.750000
2004  0.419713 -0.477434 -0.137516  0.114353  27.333333

By id

In [25]: df.groupby(['id']).mean()
Out[25]: 
           A         B         C         D  year
id                                              
0   0.535093 -1.655569 -0.309651  0.438992  2004
12 -1.599944  0.283483  0.146981  0.331241  2004
34  1.627488 -1.154193 -0.228627 -0.188025  2004
38  0.332428 -0.427696 -1.324072  2.158907  2003
53 -1.343306 -0.288373  0.544344 -1.361189  2003
70  0.825812 -0.159454  0.264984 -0.285401  2003
72  1.928094  0.532387 -0.352155 -0.039304  2004
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • 1
    FYI, you also have the option of storing to ``HDFStore``, then pulling this back in say grouping by years (or countries), so in-memory is not so much (its a bit more manual this way), but in theory you could handle unlimited data (but 1gb is not much these days :). Let me know if you want to explore this solution – Jeff Jul 11 '13 at 16:56
  • Thanks Jeff, I was complicating the operation in my head. Your right 1gb isnt too much compared to actual large data sets, but holding many copies in memory can definitely slow my personal machine down. I'm familiar with `HDFStore` from some basic exposure to `Hadoop`, but I'm not familiar with using it with `Pandas`. Out of interest, can you point me to a place to take a look? Thanks! – agconti Jul 11 '13 at 17:47