I have a pandas DataFrame of measurements and corresponding weights:
df = pd.DataFrame({'x': np.random.randn(1000), 'w': np.random.rand(1000)})
I want to smooth the measurement values (x
) while taking the element-wise
weights (w
) into account. This is independent of the sliding window's weights,
which I'd also like to apply (e.g. a triangle window, or something fancier). So, to calculate the smoothed value within each window, the function should weight the sliced elements of x
not only by the window function (e.g. triangle), but also by the corresponding elements in w
.
As far as I can tell, pd.rolling_apply
won't do it, because it applies the
given function over x
and w
separately. Similarly, pd.rolling_window
also doesn't take the source DataFrame's element-wise weights into account; the weighted window (e.g. 'triangle') can be user-defined, but is fixed up front.
Here's my slow-ish implementation:
def rolling_weighted_triangle(x, w, window_size):
"""Smooth with triangle window, also using per-element weights."""
# Simplify slicing
wing = window_size // 2
# Pad both arrays with mirror-image values at edges
xp = np.r_[x[wing-1::-1], x, x[:-wing-1:-1]]
wp = np.r_[w[wing-1::-1], w, w[:-wing-1:-1]]
# Generate a (triangular) window of weights to slide
incr = 1. / (wing + 1)
ramp = np.arange(incr, 1, incr)
triangle = np.r_[ramp, 1.0, ramp[::-1]]
# Apply both sets of weights over each window
slices = (slice(i - wing, i + wing + 1) for i in xrange(wing, len(x) + wing))
out = (np.average(xp[slc], weights=triangle * wp[slc]) for slc in slices)
return np.fromiter(out, x.dtype)
How can I speed this up with numpy/scipy/pandas?
The dataframe can take up a nontrivial portion of RAM already (10k to 200M rows), so e.g. allocating a 2D array of window-weights-per-element up front is too much. I'm trying to minimize the use of temporary arrays, maybe using
np.lib.stride_tricks.as_strided
and np.apply_along_axis
or np.convolve
, but haven't found anything to fully replicate the above.
Here's the equivalent with a uniform window, rather than a triangle (using the get_sliding_window trick from here) -- close but not quite there:
def get_sliding_window(a, width):
"""Sliding window over a 2D array.
Source: https://stackoverflow.com/questions/37447347/dataframe-representation-of-a-rolling-window/41406783#41406783
"""
# NB: a = df.values or np.vstack([x, y]).T
s0, s1 = a.strides
m, n = a.shape
return as_strided(a,
shape=(m-width+1, width, n),
strides=(s0, s0, s1))
def rolling_weighted_average(x, w, window_size):
"""Rolling weighted average with a uniform 'boxcar' window."""
wing = window_size // 2
window_size = 2 * wing + 1
xp = np.r_[x[wing-1::-1], x, x[:-wing-1:-1]]
wp = np.r_[w[wing-1::-1], w, w[:-wing-1:-1]]
x_w = np.vstack([xp, wp]).T
wins = get_sliding_window(x_w, window_size)
# TODO - apply triangle window weights - multiply over wins[,:,1]?
result = np.average(wins[:,:,0], axis=1, weights=wins[:,:,1])
return result