1

This question has a lot of useful answers on how to get a moving average. I have tried the two methods of numpy convolution and numpy cumsum and both worked fine on an example dataset, but produced a shorter array on my real data.

The data are spaced by 0.01. The example dataset has a length of 50, the real data tens of thousands. So it must be something about the window size that is causing the problem and I don't quite understand what is going on in the functions.

This is how I define the functions:

def smoothMAcum(depth,temp, scale): # Moving average by cumsum, scale = window size in m
    dz = np.diff(depth)  
    N = int(scale/dz[0])
    cumsum = np.cumsum(np.insert(temp, 0, 0)) 
    smoothed=(cumsum[N:] - cumsum[:-N]) / N 
    return smoothed

def smoothMAconv(depth,temp, scale): # Moving average by numpy convolution
    dz = np.diff(depth) 
    N = int(scale/dz[0])
    smoothed=np.convolve(temp, np.ones((N,))/N, mode='valid') 
    return smoothed

Then I implement it:

scale = 5.
smooth = smoothMAconv(dep,data, scale)

but print len(dep), len(smooth) returns 81071 80572

and the same happens if I use the other function. How can I get the smooth array of the same length as the data?

And why did it work on the small dataset? Even if I try different scales (and use the same for the example and for the data), the result in the example has the same length as the original data, but not in the real application. I considered an effect of nan values, but if I have a nan in the example, it doesn't make a difference.

So where is the problem, if possible to tell without the full dataset?

durbachit
  • 4,626
  • 10
  • 36
  • 49
  • It is possible include a reproducible example with a data set of large size, by simulated data (such as a random array in my answer). –  Nov 25 '17 at 20:08

1 Answers1

15

The second of your approaches is easy to modify to preserve the length, because numpy.convolve supports the parameter mode='same'.

np.convolve(temp, np.ones((N,))/N, mode='same') 

This is made possible by zero-padding the data set temp on both sides, - which will inevitably have some effect at the boundaries unless your data happens to be 0 near the boundaries. Example:

N = 10
x = np.linspace(0, 2, 100)
y = x**2 + np.random.uniform(size=x.shape)
y_smooth = np.convolve(y, np.ones((N,))/N, mode='same') 
plt.plot(x, y, 'r.')
plt.plot(x, y_smooth)
plt.show()

smoothing

The boundary effect of zero-padding is very visible at the right end, where the data points are about 4-5 but are padded by 0.

To reduce this undesired effect, use numpy.pad for more intelligent padding; reverting to mode='valid' for convolution. The pad width must be such that in total N-1 elements are added, where N is the size of moving window.

y_padded = np.pad(y, (N//2, N-1-N//2), mode='edge')
y_smooth = np.convolve(y_padded, np.ones((N,))/N, mode='valid') 

padding

Padding by edge values of an array looks much better.