I have a few million datapoints, each with a time and a value. I'm interested in knowing all of the sliding windows, (ie, chunks of 4000 datapoints) where the range from high to low of the window exceeds a constant threshold.
For example:, assume a window of length 3, and a threshold where high - low > 3. Then the series: [10 12 14 13 10 11 16 14 17] would result in [0, 2, 4, 5] because those are the indexes where the 3 period window's high - low range exceeded the threshold.
I have a window size of 4000 and a dataset size of millions.
The naive approach is to just calculate every possible window range, ie 1-4000, 2-4001, 3-4002, etc, and accumulate those sets that breached the threshold. This takes forever as you might imagine for large datasets.
So, the algorithm I think would be better is the following:
Calculate the range of the first window (1-4000), and store the index of the high/low of the window range. Then, iterate to (2-4001, 3-4002) etc. Only update the high/low index if the NEW value on the far right of the window is higher/lower than the old cached value.
Now, let's say the high/low indexes of the 1-4000 window is 333 and 666 respectively. I iterate and continue updating new highs/lows as I see them on the right, but as soon as the window is at 334-4333 (as soon as the cached high/low is outside of the current window) I recalculate the high/low for the current window (334-4333), cache, and continue iterating.
My question is:
1.) Is there a mathematical formula for this that eliminates the need for an algorithm at all? I know there are formulas for weighted and exponential moving averages over a window period that don't require recalculation of the window.
2.) Is my algorithm sensible? Accurate? Is there a way it could be greatly simplified or improved?
Thanks a lot.