Pandas: Performance For Rolling Rank On Large Dataframes

Question

I am looking to migrate a statistical analysis project to pandas. I would like to rank 3 columns over a rolling window of N days. I have found methods to do this as answered in this question [rank-data-over-a-rolling-window][1], but the performance isn't adequate for my data set (45K rows). The fastest way I have found is to use the bottleneck library or numpy argsort as below. This has dramatically improved performance, but is still some way off when compared to the rolling_mean function which should have similar performance.

EDIT: I have updated the below code to provide a reproducible example with timings. The series rank function is the most flexible allowing me to choose how to rank ties, but is very slow. The best two I can find are the bottleneck method or argsort. Both are comparable in performance but are restrictive on their handling of ties. However both are still considerably slower when compared to rolling mean?

rollWindow = 240
df = pd.DataFrame(np.random.randn(100000,4), columns=list('ABCD'), index=pd.date_range('1/1/2000', periods=100000, freq='1H'))
df.iloc[-3:-1]['A'] = 7.5
df.iloc[-1]['A'] = 5.5

df["SER_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankOnSeries)
 # 28.9secs (allows competition/min ranking for ties)

df["SCIPY_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankSciPy)
 # 70.89secs (allows competition/min ranking for ties)

df["BNECK_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankBottleneck)
 # 3.64secs (only provides average ranking for ties)

df["ASRT_RK"] = pd.rolling_apply(df["A"], rollWindow, rollingRankArgSort)
 # 3.56secs (only provides competition/min ranking for ties)

df["MEAN"] = pd.rolling_mean(df['A'], window=rollWindow)
 # 0.008secs

def rollingRankOnSeries (array):
    s = pd.Series(array)
    return s.rank(method='min', ascending=False)[len(s)-1]

def rollingRankSciPy (array):
     return array.size + 1 - sc.rankdata(array)[-1]

def rollingRankBottleneck (array):
    return array.size + 1 - bd.rankdata(array)[-1]

def rollingRankArgSort (array):
    return array.size - array.argsort().argsort()[-1]


                        A   SER_RK  SCIPY_RK  BNECK_RK  ASRT_RK     MEAN  
2011-05-29 11:00:00  1.37       23      23.0      23.0     23   0.013526  
2011-05-29 12:00:00  0.45       85      85.0      85.0     85   0.016833   
2011-05-29 13:00:00  7.50        1       1.0       1.0      1   0.049606   
2011-05-29 14:00:00  7.50        1       1.5       1.5      1   0.083655   
2011-05-29 15:00:00  5.50        3       3.0       3.0      3   0.112001

I have previously implemented moving window statistics by maintaining the difference between each window (online) to easily calculate the change in rank where as it appears I currently have to completely re-rank every window, which is unnecessary. I have seen that a similar question has been asked previously [Pandas performance on rolling stats][2].

Do you know if there is a way in pandas to perform this calculation more efficiently?
Is there an easy way to implement a function on a moving window in pandas where I can find the element(s) added and removed for each step and return a value accordingly, possibly maintaining my own running rank calculation?

Thanks

[1]: http://stackoverflow.com/questions/14440187/rank-data-over-a-rolling-window-in-pandas-dataframe
[2]: http://stackoverflow.com/questions/24613850/pandas-performance-on-multiple-rolling-statistics-on-different-time-intervals

Yojimbo · Answer 1 · 2015-02-11T16:53:42.810

the documentation here does what you are describing I believe.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000',
    periods=1000))
plot1 = pd.rolling_max(ts, 240)
plot2 = pd.rolling_min(ts, 240)
plot3 = pd.rolling_mean(ts, 240)

plt.plot(plot1.values.tolist())
plt.plot(plot2.values.tolist())
plt.plot(plot3.values.tolist())
plt.show()

This is how Pandas is optimized to perform the task. If this is not fast enough, I'm not sure that a workaround will be faster than the built in function. If this is redundant feel free to downvote :)

EDIT: is this more what you were talking about?

ts = pd.Series(np.random.randn(1000000), index=pd.date_range('1/1/2000', periods=1000000))

listofmax = []
for number in range(0, len(ts), 240):
    listofmax.append(ts[number:number+240].max())

with 1million rows it took .4 seconds according to timeit. Granted this is just a datetime stamp and a value. are you looking for something quicker than this, and am I better understanding what you have tried?

Hi, thanks, but I don't think this answers my question though. Those methods don't create the rank over a window. I have read through the documentation which led me to `rolling_apply`. However this appears to simply apply the function a fresh to each window and over a large dataset in can take a long time to iteratively apply that function. Using the pandas roll function was far to slow. argsort was faster, closer to the bottleneck method above, but I still believe it shouldn't be to difficult to implement a much more efficient way, either by using online windows, or a method I am missing? — PH82, Feb 11 '15 at 06:45
Thanks again for your response, but your example returns the max value. I would like the rank of the item within the rolling widow with (if possible) the min value for ties. I have updated my question to create a reproducible example. — PH82, Feb 12 '15 at 10:10

Pandas: Performance For Rolling Rank On Large Dataframes

1 Answers1