2

I have a MultiIndex pandas DataFrame of this schematic form (although the real dataframe I'm working with has millions of rows):

import pandas as pd

df = pd.DataFrame([['Alpha', 'a', 1,10], ['Alpha', 'a', 2,20],['Alpha', 'a', 3,30],
                   ['Alpha', 'b', 1,50],['Alpha', 'b', 2,60],['Alpha', 'b', 3,10],
                   ['Alpha', 'c', 1,10],['Alpha', 'c', 2,70],['Alpha', 'c', 3,80], 
                   ['Beta', 'a', 1,30], ['Beta', 'a', 2,40],['Beta', 'a', 3,50],
                   ['Beta', 'b', 1,80],['Beta', 'b', 2,70], ['Beta', 'b', 3,60],
                   ['Beta', 'c', 1,10],['Beta', 'c', 2,30], ['Beta', 'c', 3,40]], 
                    columns=['Product', 'Scenario', 'Time','Price'])\
                  .set_index(['Product', 'Scenario'])

Grouping by product and time, I would like the rank autocorrelation of 'Price' across scenarios, defined schematically as follows (product by product):

(rank autocorrelation at time t) = rank correlation('Price' at time t , 'Price' at time t+1)

where ('Price' at time t) is a vector with as many entries as scenarios (and we have one such vector for each product). For example, for product 'Alpha' at time t=2, the desired quantity is

import scipy
from scipy import stats as ss

ss.spearmanr([20,60,70] , [30,10,80])[0] 

which gives 0.5. For the last time step (t=3 in this case) it should simply give NaN back (because there's no data for t+1). All in all, the desired resulting dataframe should be:

import numpy as np
df_result =  pd.DataFrame([['Alpha',1,0], ['Alpha',2,0.5],['Alpha',3,np.nan],                        
                           ['Beta',1,1], ['Beta',2,1],['Beta',3,np.nan]], 
                           columns=['Product', 'Time','Autocorrelation'])\
                          .set_index(['Product'])

        Time    Autocorrelation
Product     
Alpha      1                0.0
Alpha      2                0.5
Alpha      3                NaN
Beta       1                1.0
Beta       2                1.0
Beta       3                NaN

I thought to start with something like

df.groupby(['Product','Time']).agg(autocorrelation function)

but then the aggregation operation requires data from two different groups (i.e. groups for different times), and I haven't found a way to do that so far. Any ideas?

Jon
  • 83
  • 1
  • 4
  • Like so? https://stackoverflow.com/questions/26083293/calculating-autocorrelation-of-pandas-dataframe-along-each-column, specifically `df.Price.autocorr(lag = 1)` – Evan Jan 23 '18 at 01:35
  • Unfortunately, as far as I can see df.autocorr() does not apply to rank (Spearman) correlation. Moreover, I need to calculate one autocorrelation for each time step aggregating over scenarios (i.e. the vectors that enter the autocorrelation span across scenarios, for times t and t+1). – Jon Jan 23 '18 at 04:32
  • You can shift the `Price`: `df['price_shifted'] = df.Price.shift(-1)`, then use a Spearman correlation: `df.groupby(['Product', 'Time']).corr(method = 'spearman')`. However, this does not return `NaN`s correctly, and I'm not sure why. – Evan Jan 23 '18 at 17:36
  • https://stackoverflow.com/questions/28988627/pandas-correlation-groupby h/t @JohnE – Evan Jan 23 '18 at 17:37

0 Answers0