I am looking to migrate a statistical analysis project to pandas. I would like to rank 3 columns over a rolling window of N days. I have found methods to do this as answered in this question [rank-data-over-a-rolling-window][1], but the performance isn't adequate for my data set (45K rows). The fastest way I have found is to use the bottleneck library or numpy argsort as below. This has dramatically improved performance, but is still some way off when compared to the rolling_mean function which should have similar performance.
EDIT: I have updated the below code to provide a reproducible example with timings. The series rank function is the most flexible allowing me to choose how to rank ties, but is very slow. The best two I can find are the bottleneck method or argsort. Both are comparable in performance but are restrictive on their handling of ties. However both are still considerably slower when compared to rolling mean?
rollWindow = 240
df = pd.DataFrame(np.random.randn(100000,4), columns=list('ABCD'), index=pd.date_range('1/1/2000', periods=100000, freq='1H'))
df.iloc[-3:-1]['A'] = 7.5
df.iloc[-1]['A'] = 5.5
df["SER_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankOnSeries)
# 28.9secs (allows competition/min ranking for ties)
df["SCIPY_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankSciPy)
# 70.89secs (allows competition/min ranking for ties)
df["BNECK_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankBottleneck)
# 3.64secs (only provides average ranking for ties)
df["ASRT_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankArgSort)
# 3.56secs (only provides competition/min ranking for ties)
df["MEAN"] = pd.rolling_mean(df['A'], window=rollWindow)
# 0.008secs
def rollingRankOnSeries (array):
s = pd.Series(array)
return s.rank(method='min', ascending=False)[len(s)-1]
def rollingRankSciPy (array):
return array.size + 1 - sc.rankdata(array)[-1]
def rollingRankBottleneck (array):
return array.size + 1 - bd.rankdata(array)[-1]
def rollingRankArgSort (array):
return array.size - array.argsort().argsort()[-1]
A SER_RK SCIPY_RK BNECK_RK ASRT_RK MEAN
2011-05-29 11:00:00 1.37 23 23.0 23.0 23 0.013526
2011-05-29 12:00:00 0.45 85 85.0 85.0 85 0.016833
2011-05-29 13:00:00 7.50 1 1.0 1.0 1 0.049606
2011-05-29 14:00:00 7.50 1 1.5 1.5 1 0.083655
2011-05-29 15:00:00 5.50 3 3.0 3.0 3 0.112001
I have previously implemented moving window statistics by maintaining the difference between each window (online) to easily calculate the change in rank where as it appears I currently have to completely re-rank every window, which is unnecessary. I have seen that a similar question has been asked previously [Pandas performance on rolling stats][2].
- Do you know if there is a way in pandas to perform this calculation more efficiently?
- Is there an easy way to implement a function on a moving window in pandas where I can find the element(s) added and removed for each step and return a value accordingly, possibly maintaining my own running rank calculation?
Thanks
[1]: http://stackoverflow.com/questions/14440187/rank-data-over-a-rolling-window-in-pandas-dataframe
[2]: http://stackoverflow.com/questions/24613850/pandas-performance-on-multiple-rolling-statistics-on-different-time-intervals