Short Version
Welford's Online Algorithm lets you keep a running value for variance - meaning you don't have to keep all the values (e.g. in a memory constraned system).
Is there something similar for Interquartile Range (IQR)? An online algorithm that lets me know the middle 50% range without having to keep all historical values?
Long Version
Keeping a running average of data, where you are memory constrainted, is pretty easy:
Double sum
Int64 count
And from this you can compute the mean:
mean = sum / count
This allows hours, or years, of observations to quietly be collected, but only take up 16-bytes.
Welford's Algorithm for Variance
Normally when you want the variance (or standard deviation), you have to have all your readings, because you have to compute reading-mean
for all previous readings:
Double sumOfSquaredError = 0;
foreach (Double reading in Readings)
sumOfSquaredError += Math.Square(reading - mean);
Double variance = sumOfSquaredError / count
Which can add up to terrabytes of data over time, rather than taking up something like 16 bytes
.
Which is why it was nice when Welford came up with an online algorithm for computing variance of a stream of readings:
It is often useful to be able to compute the variance in a single pass, inspecting each value xi only once; for example, when the data is being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation.
The algorithm for adding a new value to the running variance is:
void addValue(Double newValue) {
Double oldMean = sum / count;
sum += newValue;
count += 1;
Double newMean = sum / count;
if (count > 1)
variance = ((count-2)*variance + (newValue-oldMean)*(newValue-newMean)) / (count-1);
else
variance = 0;
}
How about an online algorithm for Interquartile Range (IQR)?
Interquartile Range (IRQ) is another method of getting the spread of data. It tells you how wide the middle 50% of the data is:
And from that people then generally draw a IQR BoxPlot:
Or at the very least, have the values Q1
and Q3
.
Is there a way to calculate the Interquartile Range without having to keep all the recorded values?
In other words:
Is there something like Welford's online variance algorithm, but for Interquartile Range?
Knuth, Seminumerical Algorithms
You can find Welford's algorithm explained in Knuth's 2nd volume Seminumerical Algorithms:
(just in case anyone thought this isn't computer science or programming related)
Research Effort
- Stackoverflow: Simple algorithm for online outlier detection of a generic time series
- Stats: Simple algorithm for online outlier detection of a generic time series
- Online outlier detection for data streams (IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & Applications, September 2011, Pages 88–96)
- Stats: Robust outlier detection in financial timeseries
- Stats: Online outlier detection
- Distance-based outlier detection in data streams (Proceedings of the VLDB Endowment, Volume 9, Issue 12, August 2016, pp 1089–1100) pdf
- Online Outlier Detection Over Data Streams (Hongyin Cui, Masters Thesis, 2005)