I am new to pandas and would like to know how to best use time bounded sliding windows and rolling statistics calculation. I process a continuous stream with different rolling data calculations (weighted average, mean, sum, max, oldest...) within different time windows (1hr, 4hrs, 1day, 1week...) also grouped by different item ID.
An output stream is produced for each item with its own rolling statistics but also with statistics from similar items (data are linked on closest variable time spans and previously identified similar items).
I currently created a custom code without using pandas where the huge speed improvement is due to: calculation of rolling statistics using differential calculation only (ie. compute difference on new data and data discarded from sliding window), linking of the variable timespan of similar items as it happens in the stream. I would like to switch to pandas but would like to be sure of expected performance.
Is there a way to achieve similar (or better) performance with pandas ? Then:
does pandas calculate each rolling statistics on all the sliding window's values or does it do differential calculation on new/old values ? Then, when creating "custom functions" for rolling statistics could I also do differential calculation to avoid the huge cost of re-processing all the values ?
what is the most efficient way to declare multiple rolling statistics on several time windows ? If I also want to group this by each item I assume I should just add something like "my_stream.groupby(item_key)", would it be still efficient ?
output: for each item, I output its own rolling statistics and statistics from similar items but timespan are variable (from 10mn to 40mn). How could I link each item row from the other item only with the closest "older" timestamp (I mean: if time is 02:00 for Item 1, and Item 2 has data at 02:01 and 01:50, I should link with data from 01:50) ? will it highly impact performance ?
I tried to create a quick illustration but not very easy :
Input:
Item | Price | Date
------- | ----- | --------------
1 | 10 | 2014 01:01:01
2 | 20 | 2014 01:01:02
1 | 20 | 2014 01:21:00
1 | 20 | 2014 01:31:01
Output:
Item | Date | Price | Mean1hr | Mean4hr | Mean24hr | Sum1hr | Sum4hr | Sum24hr | SimilarMean1hr | SimilarMean4hr | Similar24hr |
-------|------|--------|-------|-------------|-----------|-------|--------|-------|----------|----------|--------|
1 | 2014 01:21:00 | 15 | 8 | 3 | 30 | 30 | 35 | 16 | 14 | 10 |
Thanks a lot, Xavier