How to calculate a pooled standard deviation in SQL?

Question

I am collecting page load times and storing them in a SQL table. Every minute, I want to "roll up" those stats, storing the number of hits, the average load time, and the standard deviation of the load time.

After a while, I want to "roll up" those ten-minute intervals by hour, preserving total hits, average page load time, and "pooled" standard deviation.

Here is the critical part of a test script I cooked up:

select
-- Pooled standard deviation
case when sum(NumHits) = 1
then null -- stddev is undefined when sample size = 1
else
    sqrt(
        (
            sum(
                (NumHits - 1) * square(StdDev) + NumHits * square(cast(AvgLoadTime as int))
            ) - 
            sum(NumHits) * square(sum(cast(NumHits as int) * AvgLoadTime) / sum(NumHits))
        ) / -- numerator
        (
            sum(NumHits) - 1
        ) -- denominator
    ) 
end 'StdDev'

You can also see a SQL Fiddle of the entire test script.

The pooled standard deviation math is based on this equation from Wikipedia:

My questions are:

What is the best way to calculate pooled standard deviation in SQL?
Will it kill performance?

Update:

Added a case statement to handle cases where the sum of all sample sizes = 1. In that case, standard deviation is undefined.

Pending what software you are working with, I'd default to R for things like this. https://stackoverflow.com/questions/16974389/how-to-calculate-a-pooled-standard-deviation-in-r — Twelfth, Nov 07 '17 at 19:45
@Twelfth I'm in a purely-SQL system right now, but I'll look into SQL -> R interfaces. — crenshaw-dev, Nov 07 '17 at 19:54
@mac9416 - Pursue the R route here, it handles stats far better and you appear to be in a recent enough version of MSSQL to do so. SQL can handle this, it's much simpler and (I believe) better optimized in R — Twelfth, Nov 07 '17 at 21:29
@MichaelCrenshaw; regarding performance; your function smalldatetimefromparts, is not considered "best practice". You could try dateadd(hour, DATEDIFF(hour, 0, TimeCreated), 0) instead (or cast(DATEADD(hour, DATEDIFF(hour, 0, TimeCreated), 0) as smalldatetime) if you must have a smalldatetime) — Henrik Staun Poulsen, Jan 24 '20 at 12:49

How to calculate a pooled standard deviation in SQL?

0 Answers0