Smoothing in Python

Question

I often use a time averaged view for my data so that it is less noisy when I plot it. For example, if my data is taken every 1 minute, then I have two arrays, ts and ys. I then created fs which is a local averaging of the 60 nearest points in ys. I do this convolution myself by simply calculating the average of the 60 nearest points, so I don't use any modules from numpy or anything else.

I have new data where ts is a bit more sparse. That is, sometimes I miss some datapoints and so I can't simply take an average of the 60 nearest points. If my independent variable, ts, is in minutes, how do I calculate an hourly average of my dependent variable, ys, to create an hourly average function, fs, in python?

This may be a solution: http://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval — M4rtini, Jan 27 '16 at 13:51
Anything wrong with just taking the points from within +/- 30 minutes, summing them and dividing by the count? — Mark Ransom, Jan 27 '16 at 22:44
That's kind of what I ended up doing. I'll post my solution below. — drjrm3, Jan 27 '16 at 22:50
I don't know why this was marked as a duplicate. The supposedly antecedent question involves pandas and is far more specialized than this one. — Curt F., Jan 28 '16 at 05:00

score 3 · Answer 1 · answered Jan 27 '16 at 15:13

If my independent variable, ts, is in minutes, how do I calculate an hourly average of my dependent variable, ys, to create an hourly average function, fs, in python?

This is a complex problem and possible answers vary widely depending on what you mean by "hourly average".

One approach to dealing with irregularly-spaced data is to resample it. The resampling could be done with interpolation, and the resulting resampled data is then usable for whatever filter method that you like.

import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
%matplotlib inline

def y(t):
    # a function to simulate data
    return np.sin(t/20.) + 0.05*np.random.randn(len(t))

four_hours = np.arange(240)
random_time_points = np.sort(np.random.choice(four_hours, size=30, replace=False))

simulated_data = y(random_time_points)
resampled_data = np.interp(four_hours, random_time_points, simulated_data)

# here I smooth with a Savitzky-Golay filter, 
#  but you could use a moving avg or anything else
#  the window-length=61 means smooth over a 1-hour (60 minute) window
smoothed_data = savgol_filter(resampled_data, window_length=61, polyorder=0)

# plot some results
plt.plot(random_time_points, simulated_data, '.k', 
         four_hours, smoothed_data, '--b',
         four_hours, y(four_hours), '-g')

# save plot
plt.savefig('SO35038933.png')

The plot shows original "sparse" data (black points), the original "true" data (green curve), and the smoothed data (blue dotted curve).

Unfortunately my data was too large and this solution became so memory intensive that it wasn't viable. — drjrm3, Jan 27 '16 at 22:48
How many time points are in your data? If memory is limiting there is no reason that the data couldn't be broken into chunks, and the process I describe applied to each chunk. — Curt F., Jan 27 '16 at 22:59
True. For that matter there is no reason that I can't write this in C using MPI and put it on a non-uniform memory architecture where the memory can be spread out among several hosts. Unfortunately dev time is a factor and I am using python for a quick and dirty solution with minimal hassle. — drjrm3, Jan 27 '16 at 23:02
If you are confused about how to chunk the data check SO for some ideas. It's a pretty common tactic. — Curt F., Jan 28 '16 at 00:14
Thanks - not a confusion issue, just a time constraint issue. — drjrm3, Jan 28 '16 at 00:59

score 0 · Answer 2 · answered Jan 27 '16 at 22:55

If I am understanding you correctly, I think something like this may work.

import threading    

hours_worth_of_data = []    
def timer():
  threading.Timer(3600, timer).start()  #3600 seconds in an hour
  smooth = sum(hours_worth_of_data) / len(hours_worth_of_data)
  # do something with smooth here
  del hours_worth_of_data[:]  #Start over with fresh data for next hour
timer()

Whenever you get your data, also load the data into "hours_worth_of_data." Every hour it will average the data then delete the data inside of your list.

score 0 · Accepted Answer · answered Jan 27 '16 at 22:55

I ended up creating an array representing the data in terms of the unit of time I was interested in and then performing statistics on that array. An example is to create the 'minute' time into an 'hour' time and work with the average value of ys within that hour:

for i in range(len(ts0)):
    tM = ts0[i] # time in minutes
    tH = tM/60.0 # time in hours

    tHs[i] = int(tH) # now this is the data within hour tH

tHs0 = tHs[:] # keep a record of the original hourly times, there are repeats here
tHs = list(set(tHs0)) # now we have a list of the hours with no repeats

for i in range(len(ts0)):
    y0 = ys0[i]
    t0 = ts0[i]
    tH = int(t0/60.0)
    ys[tHs.index(tH)] += R0
    Cs[tHs.index(tH)] += 1 # keep a record of how many times this was seen

for i in range(len(ys)):
    ys[i] = ys[i]/Cs[i]

Smoothing in Python

3 Answers3