3

I have two dataframes: one has multi levels of columns, and another has only single level column (which is the first level of the first dataframe, or say the second dataframe is calculated by grouping the first dataframe).

These two dataframes look like the following:

first dataframe-df1 second dataframe-df2 The relationship between df1 and df2 is:

df2 = df1.groupby(axis=1, level='sector').mean()

Then, I get the index of rolling_max of df1 by:

result1=pd.rolling_apply(df1,window=5,func=lambda x: pd.Series(x).idxmax(),min_periods=4)

Let me explain result1 a little bit. For example, during the five days (window length) 2016/2/23 - 2016/2/29, the max price of the stock sh600870 happened in 2016/2/24, the index of 2016/2/24 in the five-day range is 1. So, in result1, the value of stock sh600870 in 2016/2/29 is 1.

Now, I want to get the sector price for each stock by the index in result1.

Let's take the same stock as example, the stock sh600870 is in sector ’家用电器视听器材白色家电‘. So in 2016/2/29, I wanna get the sector price in 2016/2/24, which is 8.770.

How can I do that?

feetwet
  • 3,248
  • 7
  • 46
  • 84
April
  • 31
  • 1
  • 4
  • 1
    Welcome to SO. It would be helpful if you inserted the dataframes as text in your question (you can edit it). Please follow this link for helpful information on how to ask `pandas` questions: http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – IanS May 17 '16 at 09:32

1 Answers1

2

idxmax (or np.argmax) returns an index which is relative to the rolling window. To make the index relative to df1, add the index of the left edge of the rolling window:

index = pd.rolling_apply(df1, window=5, min_periods=4, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=5, min_periods=4)
index = index.add(shift, axis=0)

Once you have ordinal indices relative to df1, you can use them to index into df1 or df2 using .iloc.


For example,

import numpy as np
import pandas as pd
np.random.seed(2016)
N = 15
columns = pd.MultiIndex.from_product([['foo','bar'], ['A','B']])
columns.names = ['sector', 'stock']
dates = pd.date_range('2016-02-01', periods=N, freq='D')
df1 = pd.DataFrame(np.random.randint(10, size=(N, 4)), columns=columns, index=dates)
df2 = df1.groupby(axis=1, level='sector').mean()

window_size, min_periods = 5, 4
index = pd.rolling_apply(df1, window=window_size, min_periods=min_periods, func=np.argmax)
shift = pd.rolling_min(np.arange(len(df1)), window=window_size, min_periods=min_periods)
# alternative, you could use
# shift = np.pad(np.arange(len(df1)-window_size+1), (window_size-1, 0), mode='constant')
# but this is harder to read/understand, and therefore it maybe more prone to bugs.
index = index.add(shift, axis=0)

result = pd.DataFrame(index=df1.index, columns=df1.columns)
for col in index:
    sector, stock = col
    mask = pd.notnull(index[col])
    idx = index.loc[mask, col].astype(int)
    result.loc[mask, col] = df2[sector].iloc[idx].values

print(result)

yields

sector      foo       bar     
stock         A    B    A    B
2016-02-01  NaN  NaN  NaN  NaN
2016-02-02  NaN  NaN  NaN  NaN
2016-02-03  NaN  NaN  NaN  NaN
2016-02-04  5.5    5    5  7.5
2016-02-05  5.5    5    5  8.5
2016-02-06  5.5  6.5    5  8.5
2016-02-07  5.5  6.5    5  8.5
2016-02-08  6.5  6.5    5  8.5
2016-02-09  6.5  6.5  6.5  8.5
2016-02-10  6.5  6.5  6.5    6
2016-02-11    6  6.5  4.5    6
2016-02-12    6  6.5  4.5    4
2016-02-13    2  6.5  4.5    5
2016-02-14    4  6.5  4.5    5
2016-02-15    4  6.5    4  3.5

Note in pandas 0.18 the rolling_apply syntax was changed. DataFrames and Series now have a rolling method, so that now you would use:

index = df1.rolling(window=window_size, min_periods=min_periods).apply(np.argmax)
shift = (pd.Series(np.arange(len(df1)))
         .rolling(window=window_size, min_periods=min_periods).min())
index = index.add(shift.values, axis=0)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677