1

New to pandas, and I'm trying to get a rolling mean with a fixed window size. But I have 2 lists which represent timestamp tuples and values. I'd like the former to be used as weight for the latter. I would also like to make sure that the gaps in data are identifiable (the timestamps are not necessarily continuous).

Example lists:

ts = [(1415969999, 1415970014), (1415970014, 1415970030), (1415970030, 1415970045), (1415970045, 1415970060), (1415970060, 1415970075), (1415970075, 1415970090), (1415970090, 1415970105), (1415970105, 1415970120), (1415970120, 1415970135), (1415970135, 1415970150), (1415970150, 1415970165), (1415970165, 1415970181), (1415970181, 1415970286), (1415970286, 1415970301), (1415970301, 1415970316)...]

values = [8.0, 13.0, 11.75, 7.0, 8.5, 16.0, 16.0, 6.5, 4.0, 8.25, 5.5, 1.0, 0.0, 0.5, 0.5, 0.0, 0.25, 0.0, 0.25, 0.0, 0.5, 0.0, 2.25, 0.0, 0.25, 0.0, 0.25, 0.0, 1.0, 0.25, 0.25, 0.0, 0.25, 0.0, 0.5, 0.25, 0.0, 1.0, 0.0, 0.5...]

What I'm using right now is:

pandas_series = pd.Series(values) window_averages = pd.rolling_mean(pandas_series, window=90) # 90 would be seconds here

But that doesn't take into account the weights. I have had a look here and here but can't quite piece this together.

EDIT

I sort of managed to get what I wanted, but I don't think the solution is quite optimal. It shows pretty much the input I need at the bottom, and includes gaps in data (which I represented with -1 for now here)

import pandas as pd

data = [(1415970014, 1415970030, 13.0), (1415970033, 1415970048, 11.75), (1415970048, 1415970053, 3.2)]
start_range = data[0][0]
end_range = data[len(data)-1][1]-1
previous_end_time = start_range
values = []

for t in data:
    start_ts, end_ts, value = t

    empties = []
    while start_ts > previous_end_time:
        empties.append(previous_end_time)
        values.append(-1)
        previous_end_time += 1

    window_length = end_ts-start_ts
    values += [value]*window_length
    previous_end_time = end_ts

s_range_datetime_start = pd.to_datetime(start_range, unit='s')
s_range_datetime_end = pd.to_datetime(end_range, unit='s')
period_range = pd.period_range(s_range_datetime_start, s_range_datetime_end, freq='s')

series = pd.Series(values, period_range)
print series

Which then yields the following, basically extrapolated the data the 1 seconds.

2014-11-14 13:00:14    13.00
2014-11-14 13:00:15    13.00
2014-11-14 13:00:16    13.00
2014-11-14 13:00:17    13.00
2014-11-14 13:00:18    13.00
2014-11-14 13:00:19    13.00
2014-11-14 13:00:20    13.00
2014-11-14 13:00:21    13.00
2014-11-14 13:00:22    13.00
2014-11-14 13:00:23    13.00
2014-11-14 13:00:24    13.00
2014-11-14 13:00:25    13.00
2014-11-14 13:00:26    13.00
2014-11-14 13:00:27    13.00
2014-11-14 13:00:28    13.00
2014-11-14 13:00:29    13.00
2014-11-14 13:00:30    -1.00
2014-11-14 13:00:31    -1.00
2014-11-14 13:00:32    -1.00
2014-11-14 13:00:33    11.75
2014-11-14 13:00:34    11.75
2014-11-14 13:00:35    11.75
2014-11-14 13:00:36    11.75
2014-11-14 13:00:37    11.75
2014-11-14 13:00:38    11.75
2014-11-14 13:00:39    11.75
2014-11-14 13:00:40    11.75
2014-11-14 13:00:41    11.75
2014-11-14 13:00:42    11.75
2014-11-14 13:00:43    11.75
2014-11-14 13:00:44    11.75
2014-11-14 13:00:45    11.75
2014-11-14 13:00:46    11.75
2014-11-14 13:00:47    11.75
2014-11-14 13:00:48     3.20
2014-11-14 13:00:49     3.20
2014-11-14 13:00:50     3.20
2014-11-14 13:00:51     3.20
2014-11-14 13:00:52     3.20

My idea was to then apply the rolling mean on this time period.

Community
  • 1
  • 1
L-R
  • 1,214
  • 1
  • 19
  • 40

1 Answers1

3

First package up the data

In [26]: df = DataFrame(ts)

In [27]: df.columns=['start','end']

Your values are too long here (for what is shown)

In [28]: df['value'] = values[:len(df)]

In [29]: df
Out[29]: 
         start         end  value
0   1415969999  1415970014   8.00
1   1415970014  1415970030  13.00
2   1415970030  1415970045  11.75
3   1415970045  1415970060   7.00
4   1415970060  1415970075   8.50
5   1415970075  1415970090  16.00
6   1415970090  1415970105  16.00
7   1415970105  1415970120   6.50
8   1415970120  1415970135   4.00
9   1415970135  1415970150   8.25
10  1415970150  1415970165   5.50
11  1415970165  1415970181   1.00
12  1415970181  1415970286   0.00
13  1415970286  1415970301   0.50
14  1415970301  1415970316   0.50

Make the timestamps into actual datetimes

In [30]: df['start'] = pd.to_datetime(df['start'],unit='s')

In [31]: df['end'] = pd.to_datetime(df['end'],unit='s')

Sounds like you want to resample everything in a 90s window.

In [32]: df.groupby(pd.Grouper(key='start',freq='90s'))['value'].mean()
Out[32]: 
start
2014-11-14 12:58:30     8.000
2014-11-14 13:00:00    11.250
2014-11-14 13:01:30     6.875
2014-11-14 13:03:00     0.000
2014-11-14 13:04:30     0.500
Freq: 90S, Name: value, dtype: float64

Not sure what you mean by weighting the data. Pls provide a further example of output.

Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Great, I think this will work quite a bit better than my approach. However, when I shorten the window length (ex: 30 seconds), I get NaN for certain values, even though the timestamps are continuous (some are caused by gaps in timestamps, and those are fine). How can I fix this - interpolate the values? Thanks. – L-R Nov 15 '14 at 20:32
  • I have just found the `ffill()` method for that last questions - Cheers. – L-R Nov 15 '14 at 20:53