2

I'm looking for a fast way to create cumulative totals in a large SQL Server 2008 data set that partition by a particular column, potentially by using a multiple assignment variable solution. As a very basic example, I'd like to create the "cumulative_total" column below:

user_id | month | total | cumulative_total

1       | 1     | 2.0   | 2.0
1       | 2     | 1.0   | 3.0
1       | 3     | 3.5   | 8.5

2       | 1     | 0.5   | 0.5
2       | 2     | 1.5   | 2.0
2       | 3     | 2.0   | 4.0

We have traditionally done this with correlated subqueries, but over large amounts of data (200,000+ rows and several different categories of running total) this isn't giving us ideal performance.

I recently read about using multiple assignment variables for cumulative summing here:

http://sqlblog.com/blogs/paul_nielsen/archive/2007/12/06/cumulative-totals-screencast.aspx

In the example in that blog the cumulative variable solution looks like this:

UPDATE my_table
SET @CumulativeTotal=cumulative_total=@CumulativeTotal+ISNULL(total, 0)

This solution seems brilliantly fast for summing for a single user in the above example (user 1 or user 2). However, I need to effectively partition by user - give me the cumulative total by user by month.

Does anyone know of a way of extending the multiple assignment variable concept to solve this, or any other ideas other than correlated subqueries or cursors?

Many thanks for any tips.

Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
Andy W
  • 305
  • 4
  • 11

2 Answers2

6

If you don't need to STORE the data (which you shouldn't, because you need to update the running totals any time any row is changed, added or deleted), and if you don't trust the quirky update (which you shouldn't, because it isn't guaranteed to work and its behavior could change with a hotfix, service pack, upgrade, or even an underlying index or statistics change), you can try this type of query at runtime. This is a method fellow MVP Hugo Kornelis coined "set-based iteration" (he posted something similar in one of his chapters of SQL Server MVP Deep Dives). Since running totals typically requires a cursor over the entire set, a quirky update over the entire set, or a single non-linear self-join that becomes more and more expensive as the row counts increase, the trick here is to loop through some finite element in the set (in this case, the "rank" of each row in terms of month, for each user - and you process only each rank once for all user/month combinations at that rank, so instead of looping through 200,000 rows, you loop up to 24 times).

DECLARE @t TABLE
(
  [user_id] INT, 
  [month] TINYINT,
  total DECIMAL(10,1), 
  RunningTotal DECIMAL(10,1), 
  Rnk INT
);

INSERT @t SELECT [user_id], [month], total, total, 
  RANK() OVER (PARTITION BY [user_id] ORDER BY [month]) 
  FROM dbo.my_table;

DECLARE @rnk INT = 1, @rc INT = 1;

WHILE @rc > 0
BEGIN
  SET @rnk += 1;

  UPDATE c SET RunningTotal = p.RunningTotal + c.total
    FROM @t AS c INNER JOIN @t AS p
    ON c.[user_id] = p.[user_id]
    AND p.rnk = @rnk - 1
    AND c.rnk = @rnk;

  SET @rc = @@ROWCOUNT;
END

SELECT [user_id], [month], total, RunningTotal
FROM @t
ORDER BY [user_id], rnk;

Results:

user_id  month   total   RunningTotal
-------  -----   -----   ------------
1        1       2.0     2.0
1        2       1.0     3.0
1        3       3.5     6.5 -- I think your calculation is off
2        1       0.5     0.5
2        2       1.5     2.0
2        3       2.0     4.0

Of course you can update the base table from this table variable, but why bother, since those stored values are only good until the next time the table is touched by any DML statement?

UPDATE mt
  SET cumulative_total = t.RunningTotal
  FROM dbo.my_table AS mt
  INNER JOIN @t AS t
  ON mt.[user_id] = t.[user_id]
  AND mt.[month] = t.[month];

Since we're not relying on implicit ordering of any kind, this is 100% supported and deserves a performance comparison relative to the unsupported quirky update. Even if it doesn't beat it but comes close, you should consider using it anyway IMHO.

As for the SQL Server 2012 solution, Matt mentions RANGE but since this method uses an on-disk spool you should also test with ROWS instead of just running with RANGE. Here is a quick example for your case:

SELECT
  [user_id],
  [month],
  total,
  RunningTotal = SUM(total) OVER 
  (
    PARTITION BY [user_id] 
    ORDER BY [month] ROWS UNBOUNDED PRECEDING
  )
FROM dbo.my_table
ORDER BY [user_id], [month];

Compare this with RANGE UNBOUNDED PRECEDING or no ROWS\RANGE at all (which will also use the RANGE on-disk spool). The above will have lower overall duration and way less I/O, even though the plan looks slightly more complex (an additional sequence project operator).

I've recently published a blog post outlining some performance differences I observed for a specific running totals scenario:

http://www.sqlperformance.com/2012/07/t-sql-queries/running-totals

Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
2

Your options in SQL Server 2008 are reasonably limited - in that you can either do something based on the method as above (which is called a 'quirky update') or you can do something in the CLR.

Personally I would go with the CLR because it's guaranteed to work, while the quirky update syntax isn't something that's formally supported (so might break in future versions).

The variation on quirky update syntax you're looking for would be something like:

UPDATE my_table
SET @CumulativeTotal=cumulative_total=ISNULL(total, 0) + 
        CASE WHEN @user=@lastUser THEN @CumulativeTotal ELSE 0 END, 
    @user=lastUser

It's worth noting that in SQL Server 2012 introduces RANGE support to windowing functions, and so this is expressible in a way that is the most efficient, while being 100% supported.

Matt Whitfield
  • 6,436
  • 3
  • 29
  • 44
  • Excellent, that's perfect - thank you Matt. Concerns about support and the note about RANGE noted and appreciated. – Andy W May 31 '12 at 11:04
  • @Matt as I indicated at the end of my answer, in case you're already using `RANGE` for this, you should test that `ROWS` produces the same results (it doesn't in all cases), since it *should* be more efficient. Also, do you know of any published CLR solutions that are faster than the approaches on this page? I know CLR is fantastic for some things (e.g. splitting a string) but what in particular makes it better at this problem? Slightly faster math calculations? I'd kind of expect the CLR overhead to outweigh any benefits there (but would happily be proven wrong). – Aaron Bertrand May 31 '12 at 12:40
  • @AaronBertrand - Yeah, I should have said RANGE/ROWS explicitly, as I tend to view that whole syntax extension under one roof. The CLR bit I am talking about is doing calculations on a spool, which it's excellent at. I have had very good results doing that sort of thing with the CLR before. It would be an interesting exercise to try and get some good measurements on it. – Matt Whitfield Jun 03 '12 at 20:54
  • @Matt I actually [did some testing on this](http://www.sqlperformance.com/2012/07/t-sql-queries/running-totals) and the CLR approach rivaled the new SQL Server 2012 method (all other methods performed worse). The cursor, though, outperformed even the quirky update. – Aaron Bertrand Jul 24 '12 at 15:20
  • @AaronBertrand - Good work there. Don't get me wrong, I wouldn't recommend the quirky update - just trying to answer the question that was asked. – Matt Whitfield Jul 24 '12 at 15:43