4

I am new to pandas and would like to know how to best use time bounded sliding windows and rolling statistics calculation. I process a continuous stream with different rolling data calculations (weighted average, mean, sum, max, oldest...) within different time windows (1hr, 4hrs, 1day, 1week...) also grouped by different item ID.

An output stream is produced for each item with its own rolling statistics but also with statistics from similar items (data are linked on closest variable time spans and previously identified similar items).

I currently created a custom code without using pandas where the huge speed improvement is due to: calculation of rolling statistics using differential calculation only (ie. compute difference on new data and data discarded from sliding window), linking of the variable timespan of similar items as it happens in the stream. I would like to switch to pandas but would like to be sure of expected performance.

Is there a way to achieve similar (or better) performance with pandas ? Then:

  1. does pandas calculate each rolling statistics on all the sliding window's values or does it do differential calculation on new/old values ? Then, when creating "custom functions" for rolling statistics could I also do differential calculation to avoid the huge cost of re-processing all the values ?

  2. what is the most efficient way to declare multiple rolling statistics on several time windows ? If I also want to group this by each item I assume I should just add something like "my_stream.groupby(item_key)", would it be still efficient ?

  3. output: for each item, I output its own rolling statistics and statistics from similar items but timespan are variable (from 10mn to 40mn). How could I link each item row from the other item only with the closest "older" timestamp (I mean: if time is 02:00 for Item 1, and Item 2 has data at 02:01 and 01:50, I should link with data from 01:50) ? will it highly impact performance ?

I tried to create a quick illustration but not very easy :

Input:

 Item   | Price | Date          
------- | ----- | --------------
1       | 10    | 2014 01:01:01 
2       | 20    | 2014 01:01:02 
1       | 20    | 2014 01:21:00 
1       | 20    | 2014 01:31:01 

Output:

 Item   | Date  | Price | Mean1hr   | Mean4hr   | Mean24hr  | Sum1hr    | Sum4hr    | Sum24hr | SimilarMean1hr | SimilarMean4hr | Similar24hr |     
 -------|------|--------|-------|-------------|-----------|-------|--------|-------|----------|----------|--------|     
 1  | 2014 01:21:00 | 15    | 8 | 3 | 30    | 30    | 35    | 16    | 14    | 10 | 

Thanks a lot, Xavier

doxav
  • 978
  • 8
  • 14
  • 3
    well this is a pretty broad question. you'd have to show some samples and what you expect. rolling stats are not on-line (e.g. incremental). – Jeff Jul 07 '14 at 15:19
  • I think you provided the answer: if rolling statistics can't be incremental, I doubt I could achieve similar performance because each data is computed on all values then (ie. in my case, for a month lag data it is nearly 3000 values each time). – doxav Jul 07 '14 at 15:56
  • ok, well pandas is pretty fast as all of this code is in cython. so you should give a try. (my 2c are that the syntax is so powerful and so general that even if its slower in some cases, benefit is so worth it). – Jeff Jul 07 '14 at 15:59
  • then do you have any hint or exemple for question 2 and 3 ? I tried to put data samples – doxav Jul 07 '14 at 16:04
  • http://pandas.pydata.org/pandas-docs/stable/timeseries.html#up-and-downsampling, lots to read in the docs – Jeff Jul 07 '14 at 16:07
  • thanks @Jeff. Apparently no incremental statistics and rolling_* stats does not accept a time interval parameter (there is a custom solution http://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval but it is not incremental). My feed is ordered in time, so the solution is to extract for each new row a DataFrame for each latest time interval (ie. last 1hr, 4hrs, 1day, 1week), and process the expected statistics on each of those DataFrames. – doxav Jul 09 '14 at 22:01

0 Answers0