Okay, here is are my first results (managed to get a ~7x improvement). However, I'm pretty sure that if you assume no nans, you can get a ~100x to 1000x speed improvement, but that's for another time. -- update, see the edit below
Profiling the get_slope function reveals the 3 bottlenecks:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
14 def get_slope(df):
15 998 2374316.0 2379.1 18.6 df = df.dropna()
16 998 494087.0 495.1 3.9 min_date = df.index.min()
17 998 6298961.0 6311.6 49.4 x = (df.index - min_date).total_seconds()/3600
18 998 131353.0 131.6 1.0 y = np.array(df)
19 998 3447066.0 3454.0 27.0 slope, intercept, r_value, p_value, std_err = linregress(x, y)
20 998 8157.0 8.2 0.1 return slope
As we can see, dropna
, the creation of x
, and the slope calculations are what takes time. There is no easy solution to the dropna
problem, but the other two slow functions can be removed. The slope computation actually does a least-squares fitting, which, as noted by @zap, can be slightly improved by a polyfit, but it can be accelerated even more if we hard-code it:
def get_slope2(df):
df = df.dropna() # takes 24.5% of the time
min_date = df.index.min() # takes 4.5% of the time
x = (df.index - min_date).total_seconds()/3600 # takes 70% of the time
x = x.to_numpy()
y = df.to_numpy()
n = len(x)
xsum = x.sum()/n
ysum = y.sum()/n
xx = x.dot(x)/n
xy = x.dot(y)/n
den = xx - xsum*xsum
slope = (xy - xsum * ysum)/den
return slope
This version is already ~1.5x faster. To solve the problem of the computation of x
, the solution is to make the conversion to seconds only once, for the whole array, and use the seconds as index. The slope function would look like
def get_slope3(df2):
df2 = df2.dropna()
x = df2.index.to_numpy()
x -= x.min()
y = df2.to_numpy()
n = len(x)
xsum = x.sum()/n
ysum = y.sum()/n
xx = x.dot(x)/n
xy = x.dot(y)/n
den = xx - xsum*xsum
slope = (xy - xsum * ysum)/den
return slope
with the new dataframe being
min_date = df.index.min()
df2 = df.set_index((df.index - min_date).total_seconds()/3600)
with 10000 elems, I get the following timings:
original : time = 6919.07 ms
get_slope2 : time = 4542.78 ms
get_slope3 : time = 942.982 ms
And commenting the dropna
adds an additional
2x speedup.
Some further optimizatons would be to compute everything at once. If there are no nans, we can compute the sums as a difference of a global cumsum, which would be insanely fast, allowing a O(n) time (with a small constant) regardless of the window size. If there are nans, this approach could also be used by interpolating the the values for the nans, then recomputing the gradients around the interoplated values by more traditional means, but this gets a bit complicated (but since you said radically)
Edit : getting a 1000x speedup (+ solving the window problem)
The idea here will be to compute everything in a handful of (numpy) function calls. To do so, we will need to know the local x
and y
, and thus compute which data points to use based window size (given in hours here). The number of data-points is computed by the function
from numba import njit
import warnings
@njit
def getwinsize(x, win, min_periods):
m = 0
n = x.size
out = np.empty(n, dtype=np.int32)
i = 0
j = 0
while(i < n):
if x[j] + win > x[i]:
out[i] = i-j+1 if i-j+1 >= min_periods else -1
m = m if m > out[i] else out[i]
i += 1
else:
j += 1
return out, m
Using the njit macro from numba is not necessary but it surely helps, especially when the inputs are large. The slope-computing-function is
def get_slope4(df, winsizeInHours=7*24, min_periods=3):
hours = (df.index - min_date).total_seconds().to_numpy()/3600
y = df.to_numpy().ravel()
N = len(hours)
locwinsize, maxwinsize = getwinsize(hours, winsizeInHours, min_periods)
X = np.empty((N, maxwinsize))
Y = np.empty((N, maxwinsize))
for i in range(maxwinsize):
X[i:,i] = hours[:N-i]
Y[i:,i] = y[:N-i]
mask = np.isnan(Y)
for i in range(maxwinsize):
mask[:, i] = np.logical_or(mask[:, i], locwinsize<=i)
X[mask] = np.NaN
Y[mask] = np.NaN
XY = X*Y
XX = X*X
with warnings.catch_warnings(): #ignore warning for "mean of empty slie"
warnings.simplefilter("ignore", category=RuntimeWarning)
Xbar = np.nanmean(X, axis=1)
Ybar = np.nanmean(Y, axis=1)
XXbar = np.nanmean(XX, axis=1)
XYbar = np.nanmean(XY, axis=1)
den = XXbar - Xbar*Xbar
slopes = (XYbar - Xbar * Ybar)/den
return slopes
This code gives the same result as the original one (with window="7d"
), but is much faster. The returned value is also a numpy array and not a data-frame
Here are some timings with 10000 samples:
Initial code : time = 6860.03 ms
get_slope4 without numba : time = 28.27 ms
get_slope4 with numba : time = 5.06 ms
So the non-numba version gives a 240x speed improvement and the numba version gives a >1000x speed bonus, so hopefully that's good enough.