I am collecting page load times and storing them in a SQL table. Every minute, I want to "roll up" those stats, storing the number of hits, the average load time, and the standard deviation of the load time.
After a while, I want to "roll up" those ten-minute intervals by hour, preserving total hits, average page load time, and "pooled" standard deviation.
Here is the critical part of a test script I cooked up:
select
-- Pooled standard deviation
case when sum(NumHits) = 1
then null -- stddev is undefined when sample size = 1
else
sqrt(
(
sum(
(NumHits - 1) * square(StdDev) + NumHits * square(cast(AvgLoadTime as int))
) -
sum(NumHits) * square(sum(cast(NumHits as int) * AvgLoadTime) / sum(NumHits))
) / -- numerator
(
sum(NumHits) - 1
) -- denominator
)
end 'StdDev'
You can also see a SQL Fiddle of the entire test script.
The pooled standard deviation math is based on this equation from Wikipedia:
My questions are:
- What is the best way to calculate pooled standard deviation in SQL?
- Will it kill performance?
Update:
Added a case
statement to handle cases where the sum of all sample sizes = 1. In that case, standard deviation is undefined.