2

I am collecting page load times and storing them in a SQL table. Every minute, I want to "roll up" those stats, storing the number of hits, the average load time, and the standard deviation of the load time.

After a while, I want to "roll up" those ten-minute intervals by hour, preserving total hits, average page load time, and "pooled" standard deviation.

Here is the critical part of a test script I cooked up:

select
-- Pooled standard deviation
case when sum(NumHits) = 1
then null -- stddev is undefined when sample size = 1
else
    sqrt(
        (
            sum(
                (NumHits - 1) * square(StdDev) + NumHits * square(cast(AvgLoadTime as int))
            ) - 
            sum(NumHits) * square(sum(cast(NumHits as int) * AvgLoadTime) / sum(NumHits))
        ) / -- numerator
        (
            sum(NumHits) - 1
        ) -- denominator
    ) 
end 'StdDev'

You can also see a SQL Fiddle of the entire test script.

The pooled standard deviation math is based on this equation from Wikipedia:

Pooled standard deviation equation

My questions are:

  1. What is the best way to calculate pooled standard deviation in SQL?
  2. Will it kill performance?

Update:

Added a case statement to handle cases where the sum of all sample sizes = 1. In that case, standard deviation is undefined.

crenshaw-dev
  • 7,504
  • 3
  • 45
  • 81
  • Which version of SQL Server? – gotqn Nov 07 '17 at 19:40
  • Pending what software you are working with, I'd default to R for things like this. https://stackoverflow.com/questions/16974389/how-to-calculate-a-pooled-standard-deviation-in-r – Twelfth Nov 07 '17 at 19:45
  • @gotqn It's SQL Server 2016. – crenshaw-dev Nov 07 '17 at 19:53
  • @Twelfth I'm in a purely-SQL system right now, but I'll look into SQL -> R interfaces. – crenshaw-dev Nov 07 '17 at 19:54
  • 2
    @mac9416 You need SQL Server 2016 SP1 for using R. – gotqn Nov 07 '17 at 20:13
  • 1
    @mac9416 - Pursue the R route here, it handles stats far better and you appear to be in a recent enough version of MSSQL to do so. SQL can handle this, it's much simpler and (I believe) better optimized in R – Twelfth Nov 07 '17 at 21:29
  • 1
    @MichaelCrenshaw; regarding performance; your function smalldatetimefromparts, is not considered "best practice". You could try dateadd(hour, DATEDIFF(hour, 0, TimeCreated), 0) instead (or cast(DATEADD(hour, DATEDIFF(hour, 0, TimeCreated), 0) as smalldatetime) if you must have a smalldatetime) – Henrik Staun Poulsen Jan 24 '20 at 12:49

0 Answers0