Pandas rolling median for duplicate time series data

Question

I see that Pandas does not allow duplicate time series indexes yet (https://github.com/pydata/pandas/issues/643), but will be added soon. I am wondering if there is a good way to apply rolling window means to a dataset with duplicate times by a multi-index tag/column

Basically I have a csv of non-ordered events that consist of epochtime, hierarchical tags (tag1, tag2), and time taken. A small sample:

 epochTimeMS,event,tag,timeTakenMS
 1331782842801,event1,tag1,16
 1331782841535,event1,tag2,1278
 1331782842801,event1,tag1,17
 1331782842381,event2,tag1,436

What I want to do is build and graph rolling means with varying ms windows, by event and event+tag. This seems like it should be accomplished in Pandas, but not sure if I will need to wait until the duplicate time-series indexes first. Any thoughts on hacking this in place now?

score 4 · Accepted Answer · answered Mar 18 '12 at 21:43

There's nothing really to stop you right now:

In [17]: idf = df.set_index(['tag', 'epochTimeMS'], verify_integrity=False).sort_index()

In [18]: idf
Out[18]: 
                     event  timeTakenMS
tag  epochTimeMS                       
tag1 1331782842381  event2          436
     1331782842801  event1           16
     1331782842801  event1           17
tag2 1331782841535  event1         1278

In [20]: idf.ix['tag1']
Out[20]: 
                event  timeTakenMS
epochTimeMS                       
1331782842381  event2          436
1331782842801  event1           16
1331782842801  event1           17

Accessing specific values by timestamp will cause an exception (this is going to be improved, as you mention), but you can certainly work with the data. Now, if you want a fixed-length (in time space) window, that's not supported very well yet but I created an issue here:

https://github.com/pydata/pandas/issues/936

If you could speak up on the mailing list about your API requirements in your application it would be helpful for me and the guys since we're actively working on the time series capabilities right now.

Thanks much for the clarification on verify_integrity Bear with me please, my data analysis nomenclature is limited. Do I understand correctly, the smallest current granularity for moving window time series data is one day? If I have second level data,averaging every 30 seconds worth of data, grouped by tags, is not possible? Should I post my use case simply to http://groups.google.com/group/pystatsmodels ? — Aaron, Mar 18 '12 at 23:02
The moving window functions are array functions and have a fixed window size in number of values-- e.g. 30 values. So the concept of a 30-second window doesn't make sense (yet). If you don't have any values falling between seconds, you could reconform the data to Second time rule using `asfreq` (see pandas.core.datetools) then use `rolling_mean` with `window=30` — Wes McKinney, Mar 19 '12 at 14:32

Pandas rolling median for duplicate time series data

1 Answers1

Linked