I have a MultiIndex pandas DataFrame of this schematic form (although the real dataframe I'm working with has millions of rows):
import pandas as pd
df = pd.DataFrame([['Alpha', 'a', 1,10], ['Alpha', 'a', 2,20],['Alpha', 'a', 3,30],
['Alpha', 'b', 1,50],['Alpha', 'b', 2,60],['Alpha', 'b', 3,10],
['Alpha', 'c', 1,10],['Alpha', 'c', 2,70],['Alpha', 'c', 3,80],
['Beta', 'a', 1,30], ['Beta', 'a', 2,40],['Beta', 'a', 3,50],
['Beta', 'b', 1,80],['Beta', 'b', 2,70], ['Beta', 'b', 3,60],
['Beta', 'c', 1,10],['Beta', 'c', 2,30], ['Beta', 'c', 3,40]],
columns=['Product', 'Scenario', 'Time','Price'])\
.set_index(['Product', 'Scenario'])
Grouping by product and time, I would like the rank autocorrelation of 'Price' across scenarios, defined schematically as follows (product by product):
(rank autocorrelation at time t) = rank correlation('Price' at time t , 'Price' at time t+1)
where ('Price' at time t) is a vector with as many entries as scenarios (and we have one such vector for each product). For example, for product 'Alpha' at time t=2, the desired quantity is
import scipy
from scipy import stats as ss
ss.spearmanr([20,60,70] , [30,10,80])[0]
which gives 0.5. For the last time step (t=3 in this case) it should simply give NaN back (because there's no data for t+1). All in all, the desired resulting dataframe should be:
import numpy as np
df_result = pd.DataFrame([['Alpha',1,0], ['Alpha',2,0.5],['Alpha',3,np.nan],
['Beta',1,1], ['Beta',2,1],['Beta',3,np.nan]],
columns=['Product', 'Time','Autocorrelation'])\
.set_index(['Product'])
Time Autocorrelation
Product
Alpha 1 0.0
Alpha 2 0.5
Alpha 3 NaN
Beta 1 1.0
Beta 2 1.0
Beta 3 NaN
I thought to start with something like
df.groupby(['Product','Time']).agg(autocorrelation function)
but then the aggregation operation requires data from two different groups (i.e. groups for different times), and I haven't found a way to do that so far. Any ideas?