Pandas Rolling Gradient - Improving/Reducing Computation Time

Question

I am calculating the rolling slope or gradient of a column in a pandas data frame with a datetime index and looking for suggestions to reduce computation time over the current approach using .rolling and .apply (detailed below).

You have additional requirements which are the minimum number of observations to include in the rolling calculation and the maximum window size (see example below):

Example, minimum number of points = 3, maximum window size = 7 days

datetime              values  intended_window.           gradient 
01-01-2010 00:00:00   10      np.nan                     NaN
01-02-2010 00:00:00   11      np.nan                     NaN
01-03-2010 00:00:00   12      [10,11,12]                 0.04167
01-04-2010 00:00:00   13      [10,11,12,13]              0.04167
01-05-2010 00:00:00   14      [10,11,12,13,14]           0.04167
01-06-2010 00:00:00   15      [10,11,12,13,14,15]        0.04167
01-07-2010 00:00:00   16      [10,11,12,13,14,15,16]     0.04167
01-08-2010 00:00:00   17      [11,12,13,14,15,16,17]     0.04167
01-09-2010 00:00:00   18      [12,12,14,15,16,17,18]     0.04167
01-10-2010 00:00:00   19      [13,14,15,16,17,18,19]     0.04167

The current approach is effectively:

gradient = df['values'].rolling(window='7d', min_periods=3).apply(get_slope, raw=False)

where

def get_slope(df):
  df = df.dropna()
  min_date = df.index.min()
  x = (df.index - min_date).total_seconds()/60/60
  y = np.array(df)
  slope, intercept, r_value, p_value, std_err = linregress(x,y)
  return slope

Does anyone have a suggestion on how this could be radically sped up? When increasing the maximum window size, the computation time increasing significantly. Is there anyway to vectorise this calculation?

Can we assume that there are no nan values in the middle of the dates and values? — Miguel, Jun 18 '21 at 20:53
Could you make a simple case that show how the computation time increased significantly?? then we can try to optimize the code from there and give the importment detail (for example, timeit comparision). — SCKU, Jun 20 '21 at 05:52

Miguel · Accepted Answer · 2021-06-21T22:16:42.850

Okay, here is are my first results (managed to get a ~7x improvement). However, I'm pretty sure that if you assume no nans, you can get a ~100x to 1000x speed improvement, but that's for another time. -- update, see the edit below

Profiling the get_slope function reveals the 3 bottlenecks:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    14                                           def get_slope(df):
    15       998    2374316.0   2379.1     18.6     df = df.dropna()
    16       998     494087.0    495.1      3.9     min_date = df.index.min()
    17       998    6298961.0   6311.6     49.4     x = (df.index - min_date).total_seconds()/3600
    18       998     131353.0    131.6      1.0     y = np.array(df)
    19       998    3447066.0   3454.0     27.0     slope, intercept, r_value, p_value, std_err = linregress(x, y)
    20       998       8157.0      8.2      0.1     return slope

As we can see, dropna, the creation of x, and the slope calculations are what takes time. There is no easy solution to the dropna problem, but the other two slow functions can be removed. The slope computation actually does a least-squares fitting, which, as noted by @zap, can be slightly improved by a polyfit, but it can be accelerated even more if we hard-code it:

def get_slope2(df):
    df = df.dropna() # takes 24.5% of the time
    min_date = df.index.min() # takes 4.5% of the time
    x = (df.index - min_date).total_seconds()/3600  # takes 70% of the time
    x = x.to_numpy()
    y = df.to_numpy()

    n = len(x)
    xsum = x.sum()/n
    ysum = y.sum()/n
    xx = x.dot(x)/n
    xy = x.dot(y)/n
    den = xx - xsum*xsum

    slope = (xy - xsum * ysum)/den
    return slope

This version is already ~1.5x faster. To solve the problem of the computation of x, the solution is to make the conversion to seconds only once, for the whole array, and use the seconds as index. The slope function would look like

def get_slope3(df2):
    df2 = df2.dropna()
    x = df2.index.to_numpy()
    x -= x.min()
    y = df2.to_numpy()

    n = len(x)
    xsum = x.sum()/n
    ysum = y.sum()/n
    xx = x.dot(x)/n
    xy = x.dot(y)/n
    den = xx - xsum*xsum

    slope = (xy - xsum * ysum)/den
    return slope

with the new dataframe being

min_date = df.index.min()
df2 = df.set_index((df.index - min_date).total_seconds()/3600)

with 10000 elems, I get the following timings:

original   : time = 6919.07 ms
get_slope2 : time = 4542.78 ms
get_slope3 : time = 942.982 ms

And commenting the dropna adds an additional 2x speedup.

Some further optimizatons would be to compute everything at once. If there are no nans, we can compute the sums as a difference of a global cumsum, which would be insanely fast, allowing a O(n) time (with a small constant) regardless of the window size. If there are nans, this approach could also be used by interpolating the the values for the nans, then recomputing the gradients around the interoplated values by more traditional means, but this gets a bit complicated (but since you said radically)

Edit : getting a 1000x speedup (+ solving the window problem)

The idea here will be to compute everything in a handful of (numpy) function calls. To do so, we will need to know the local x and y, and thus compute which data points to use based window size (given in hours here). The number of data-points is computed by the function

from numba import njit
import warnings

@njit
def getwinsize(x, win, min_periods):
    m = 0
    n = x.size
    out = np.empty(n, dtype=np.int32)
    i = 0
    j = 0
    while(i < n):
        if x[j] + win > x[i]:
            out[i] = i-j+1 if i-j+1 >= min_periods else -1
            m = m if m > out[i] else out[i]
            i += 1
        else:
            j += 1
    return out, m

Using the njit macro from numba is not necessary but it surely helps, especially when the inputs are large. The slope-computing-function is

def get_slope4(df, winsizeInHours=7*24, min_periods=3):
    hours = (df.index - min_date).total_seconds().to_numpy()/3600
    y = df.to_numpy().ravel()

    N = len(hours)
    locwinsize, maxwinsize = getwinsize(hours, winsizeInHours, min_periods)

    X = np.empty((N, maxwinsize))
    Y = np.empty((N, maxwinsize))

    for i in range(maxwinsize):
        X[i:,i] = hours[:N-i]
        Y[i:,i] = y[:N-i]

    mask = np.isnan(Y)
    for i in range(maxwinsize):
        mask[:, i] = np.logical_or(mask[:, i], locwinsize<=i)

    X[mask] = np.NaN
    Y[mask] = np.NaN
    XY = X*Y
    XX = X*X

    with warnings.catch_warnings(): #ignore warning for "mean of empty slie"
        warnings.simplefilter("ignore", category=RuntimeWarning)
        Xbar = np.nanmean(X, axis=1)
        Ybar = np.nanmean(Y, axis=1)
        XXbar = np.nanmean(XX, axis=1)
        XYbar = np.nanmean(XY, axis=1)

    den = XXbar - Xbar*Xbar
    slopes = (XYbar - Xbar * Ybar)/den
    return slopes

This code gives the same result as the original one (with window="7d"), but is much faster. The returned value is also a numpy array and not a data-frame

Here are some timings with 10000 samples:

Initial code : time = 6860.03 ms
get_slope4 without numba : time = 28.27 ms
get_slope4 with numba : time = 5.06 ms

So the non-numba version gives a 240x speed improvement and the numba version gives a >1000x speed bonus, so hopefully that's good enough.

In the case of setting the index of the dataframe to be the time delta you arent able to use pandas rolling with window specified in days ! — Mike Tauber, Jun 21 '21 at 08:25
(when you are suggesting to move the precalculations before the .rolling) — Mike Tauber, Jun 21 '21 at 08:43
Ok, I'm adding an improved version that corrects this and also boosts the speedup to a reasonable ~1000x improvement over the original — Miguel, Jun 21 '21 at 16:05

score 1 · Answer 2 · answered Jun 21 '21 at 00:05

I do not know I could call it "radical", but I seem to be getting a 10-15% speedup essentially for free by replacing linregress with polyfit in get slope:

def get_slope_polyfit(df):
    df = df.dropna()
    min_date = df.index.min()
    x = (df.index - min_date).total_seconds()/60/60
    y = np.array(df)
    slope, _ = polyfit(x, y, 1)
    return slope

Moving some calculations outside the loop also seems to give another 5-10% speedup.

from time import time
import pandas as pd
import numpy as np
from scipy.stats import linregress
from numpy import polyfit

from numpy.lib.stride_tricks import sliding_window_view


N = 10000

dti = pd.date_range('2010-01-01', periods=N, freq='D')

values = np.arange(N) *1.0

values[10: 20] = np.nan

df = pd.DataFrame(values, index=dti, columns=['values'])

min_date = df.index.min()
x = (df.index - min_date).total_seconds()/60/60

y = np.array(df.values).T.squeeze()


def get_slope(df):
    df = df.dropna()
    min_date = df.index.min()
    x = (df.index - min_date).total_seconds()/60/60
    y = np.array(df)
    slope, intercept, r_value, p_value, std_err = linregress(x,y)
    return slope


def get_slope_polyfit(df):
    df = df.dropna()
    min_date = df.index.min()
    x = (df.index - min_date).total_seconds()/60/60
    y = np.array(df)
    slope, _ = polyfit(x, y, 1)
    return slope


def get_slope_with_precalculations(dfi):
    x = dfi['timedelta']
    y = dfi['values']

    x = x[~np.isnan(y)]
    y = y[~np.isnan(y)]

    if x.size < 2:
        return np.nan

    slope, _ = polyfit(x, y, 1)
    return slope


print('original calculation')
begin = time()
gradient = df['values'].rolling(window='7d', min_periods=3).apply(get_slope, raw=False)
end = time()
print(f"execution time {end - begin}")

print('get_slope_polyfit calculation')
begin = time()
gradient_polyfit = df['values'].rolling(window='7d', min_periods=3).apply(get_slope_polyfit, raw=False)
end = time()
print(f"execution time {end - begin}")


print(f"with pre-calculations")
begin = time()
min_date = df.index.min()
df['timedelta'] = (df.index - min_date).total_seconds()/60/60
gradient_precalculations = np.array([get_slope_with_precalculations(dfi) for dfi in df.rolling(window='7d', min_periods=3)])
end = time()
print(f"execution time {end - begin}")

Output:

> original calculation
> execution time 9.473661422729492
> get_slope_polyfit calculation
> execution time 8.135330200195312
> with pre-calculations
> execution time 7.553420305252075

Seems that moving the calculations outside of the loop by using a list comprehension isn't possible at this time...- https://stackoverflow.com/questions/57345510/rolling-apply-on-custom-function-that-requires-multiple-columns-of-dataframe-to — Mike Tauber, Jun 21 '21 at 08:36

score 0 · Answer 3 · answered Nov 08 '22 at 13:19

using sklearn Linear regression I get 10x speedup for computing the slope of a time series.

First I set the date as index of my Pandas Series:

df.set_index('datetime ', inplace=True)

def get_slope(df):
    import datetime as dt
    from sklearn import linear_model
    # Convert the datatime to ordinal
    date_ordinal = pd.to_datetime(df.index).map(dt.datetime.toordinal)
    # Fit the model
    reg = linear_model.LinearRegression()
    reg.fit(date_ordinal.values.reshape(-1, 1), df.values)
    return reg.coef_[0]

# compute the rolling gradient 
df['gradient'] = df.value.rolling(10).apply(get_slope, raw=False)

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Nov 10 '22 at 12:03

Pandas Rolling Gradient - Improving/Reducing Computation Time

3 Answers3