1

I have a data frame that is structured similar to the following (but in the real case with many more rows and columns).

In [2]: Ex   # The example DataFrame
Out[2]: 
       NameBef  v1B  v2B   v3B   v4B    NameAft  v1A  v2A   v3A   v4A
Id                                                                   
422   firstBef  133  145   534   745   FirstAft  212  543  2342  4563
862  secondBef  234  434   345  3453  SecondAft  643  493  3433   234
935   thirdBef  232  343  6454   463   thirdAft  423  753   754   743

For each row I want to calculate the quotient each vXB and vXA value from above (the Xs are variables) to end up with a DataFrame like this one

          v1Q       v2Q       v3Q       v4Q
Id                                         
422  1.593985  3.744828  4.385768  6.124832
862  2.747863  1.135945  9.950725  0.067767
935  1.823276  2.195335  0.116827  1.604752

Where each element is the quotient of the corresponding elements of the original data frame.

I haven't been able to figure out how to do this conveniently.

To be convenient it would be good if it will not be required to provide only the names of the first and last columns of the "before" and "after" values i.e. 'v1B', 'v4B' and 'v1A', 'v4A' (i.e. not each of the columns).

The following is what I have come up with.

In [3]: C=Ex.columns
In [4]: C1B=C.get_loc('v1B')
In [5]: C2B=C.get_loc('v4B')
In [6]: C1A=C.get_loc('v1A')
In [7]: C2A=C.get_loc('v4A')
In [8]: FB=Ex.ix[:,C1B:C2B+1]
In [9]: FA=Ex.ix[:,C1A:C2A+1]

In [10]: FB           # The FB and FA frames have this structure
Out[10]: 
     v1B  v2B   v3B   v4B
Id                       
422  133  145   534   745
862  234  434   345  3453
935  232  343  6454   463

[3 rows x 4 columns]

Then finally produce the required DataFrame. This is done by doing the calculation on numpy arrays produced by DataFrame.values.

This method I got from this question/answer:

In [12]: DataFrame((FA.values*1.0) / FB.values,columns=['v1Q','v2Q','v3Q','v4Q'],index=Ex.index)
Out[12]: 
          v1Q       v2Q       v3Q       v4Q
Id                                         
422  1.593985  3.744828  4.385768  6.124832
862  2.747863  1.135945  9.950725  0.067767
935  1.823276  2.195335  0.116827  1.604752

[3 rows x 4 columns]

Am I missing something? I was hoping that I could achieve this in some much more direct way by doing some operation on the original DataFrame.

Is there no operation to do elementwise calculation directly on DataFrames instead of going via numpy arrays?

Community
  • 1
  • 1
Wurdius
  • 33
  • 4

1 Answers1

1

You could always use df.filter to select the relevant column names. It can accept a regular expression so you could specify the after/before columns with something like this:

>>> df.filter(regex=r'^v.A$').values / df.filter(regex=r'^v.B$').values
array([[ 1.59398496,  3.74482759,  4.38576779,  6.12483221],
       [ 2.74786325,  1.1359447 ,  9.95072464,  0.06776716],
       [ 1.82327586,  2.19533528,  0.11682677,  1.60475162]])

Regarding the arithmetic operation, you're not missing anything. It's necessary to use Numpy arrays (.values) here as otherwise Pandas computes values from the common index labels in both DataFrames. If an index is missing the calculation results in NaN. Numpy arrays don't have labeled indexes so the element-wise operation succeeds.

Alex Riley
  • 169,130
  • 45
  • 262
  • 238
  • Wow that is certainly much more compact. Get rid of all this manual column extraction. I just need to pick column names that are easy to filter out in the way you describe. Also thanks for confirming that doing the numpy thing is correct. – Wurdius Jan 10 '15 at 19:56
  • No problem, glad to hear `filter` might help. – Alex Riley Jan 10 '15 at 20:03