4

I have spent a good portion of today and yesterday attempting to decide whether to utilize a loop or cursor in SQL or to figure out how to use set based logic to solve the problem. I am not new to set logic, but this problem seems to be particularly complex.

The Problem

The idea is that if I have a list of all transactions (10's, 100's of millions) and a date they occurred, I can start combining some of that data into a daily totals table so that it is more rapidly view able by reporting and analytic systems. The pseudocode for this is as such:

foreach( row in transactions_table )
    if( row in totals_table already exists )
        update totals_table, add my totals to the totals row
    else
        insert into totals_table with my row as the base values
    delete ( or archive ) row

As you can tell, the block of the loop is relatively trivial to implement, and as is the cursor/looping iteration. However, the execution time is quite slow and unwieldy and my question is: is there a non-iterative way to perform such a task, or is this one of the rare exceptions where I just have to "suck it up" and use a cursor?

There have been a few discussions on the topic, some of which seem to be similar, but not usable due to the if/else statement and the operations on another table, for instance:

How to merge rows of SQL data on column-based logic? This question doesn't seem to be applicable because it simply returns a view of all sums, and doesn't actually make logical decisions about additions or updates to another table

SQL Looping seems to have a few ideas about selection with a couple of cases statements which seems possible, but there are two operations that I need done dependent upon the status of another table, so this solution does not seem to fit.

SQL Call Stored Procedure for each Row without using a cursor This solution seems to be the closest to what I need to do, in that it can handle arbitrary numbers of operations on each row, but there doesn't seem to be a consensus among that group.

Any advice how to tackle this frustrating problem?

Notes

I am using SQL Server 2008

The schema setup is as follows:

Totals: (id int pk, totals_date date, store_id int fk, machine_id int fk, total_in, total_out)

Transactions: (transaction_id int pk, transaction_date datetime, store_id int fk, machine_id int fk, transaction_type (IN or OUT), transaction_amount decimal)

The totals should be computed by store, by machine, and by date, and should total all of the IN transactions into total_in and the OUT transactions into total_out. The goal is to get a pseudo data cube going.

Community
  • 1
  • 1
Chuck Russell
  • 155
  • 1
  • 1
  • 9
  • 1
    you are looking for merge statement. Self explained in documentation: http://technet.microsoft.com/en-us/library/bb510625(v=sql.100).aspx – dani herrera Mar 12 '13 at 15:18
  • You might want to check [indexed views](http://msdn.microsoft.com/en-us/library/dd171921%28v=sql.100%29.aspx) and [designing indexed views](http://msdn.microsoft.com/en-us/library/ms187864%28v=sql.105%29.aspx). Indexed view will take care of aggregation of data on your behalf; you provide a definition, and Sql Server updates aggregated image on its own. – Nikola Markovinović Mar 12 '13 at 15:20
  • 2
    `Merge` with `output` clause , and then deleting/archiving rows that were processed (could be taken from `output`) should do the job. – a1ex07 Mar 12 '13 at 15:21
  • @AaronBertrand Why not? From the document I had linked: `CREATE VIEW ExampleView WITH SCHEMABINDING AS SELECT GroupKey, SUM(Colx) AS SumColx, COUNT_BIG(Colx) AS CountColx FROM MyTable GROUP BY GroupKey`. – Nikola Markovinović Mar 12 '13 at 15:23
  • @danihp This is a good idea. I have never heard of this operation before. I take it that example C is something most like what I am attempting? – Chuck Russell Mar 12 '13 at 15:23
  • @Nikola Well there are some caveats - you need `COUNT_BIG(*)`, not `COUNT_BIG(Colx)`, and `SUM()` needs to be non-nullable. In general I agree that indexed views can be a good use of aggregation but don't think I've ever seen it used for `SUM()`, only `COUNT_BIG()`. – Aaron Bertrand Mar 12 '13 at 15:27
  • I'd be very hesitant about `MERGE` - please see http://www.sqlperformance.com/2013/02/t-sql-queries/another-merge-bug (scroll to the list of links near the bottom). – Aaron Bertrand Mar 12 '13 at 15:29
  • @Aaron I've used it in master-detail context as performance enhancement. It worked better than summing over covering index. – Nikola Markovinović Mar 12 '13 at 15:49
  • @Nikola not saying it's bad, just that this is quite honestly the first time I've seen it used with `SUM`. I added it to my answer and gave you credit. :-) – Aaron Bertrand Mar 12 '13 at 15:50
  • @AaronBertrand From the article you sent on indexed views: "On the contrary, online transaction processing (OLTP) systems with many writes, or database applications with frequent updates, may not be able to take advantage of indexed views because of the increased maintenance cost associated with updating both the view and underlying base tables." Unfortunately, the transactions table will be quite frequently inserted to. – Chuck Russell Mar 12 '13 at 15:52
  • 1
    @ChuckRussell I didn't send that article. But consider this: if you want to calculate SUMs you're going to have to pay the cost of that calculation at some point. You can choose indexed views, and pay a small cost at each insert, or you can do the calculations manually on your own, and pay a cost inversely proportional to how often you do it. Note that with indexed views the calculations are always up to date (well except in the MERGE bug case I link to in my answer), and with manual calculations, they can become quite stale depending on frequency. So the choice is yours, really. – Aaron Bertrand Mar 12 '13 at 15:54
  • 1
    @ChuckRussell if you want to know how well something works, you'll have to test it yourself, rather than make decisions based on some vague warning like "may not be able to"... every situation is different and is chock full of inherent "it depends"... – Aaron Bertrand Mar 12 '13 at 15:55

2 Answers2

5

You would do this in two set-based statements:

BEGIN TRANSACTION;

DECLARE @keys TABLE(some_key INT);

UPDATE tot
  SET totals += tx.amount
OUTPUT inserted.some_key -- key values updated
INTO @keys
FROM dbo.totals_table AS tot WITH (UPDLOCK, HOLDLOCK)
INNER JOIN 
(
  SELECT t.some_key, amount = SUM(amount)
  FROM dbo.transactions_table AS t WITH (HOLDLOCK)
  INNER JOIN dbo.totals_table AS tot
  ON t.some_key = tot.some_key
  GROUP BY t.some_key
) AS tx
ON tot.some_key = tx.some_key;

INSERT dbo.totals_table(some_key, amount)
  OUTPUT inserted.some_key INTO @keys
  SELECT some_key, SUM(amount)
  FROM dbo.transactions_table AS tx
  WHERE NOT EXISTS 
  (
    SELECT 1 FROM dbo.totals_table
    WHERE some_key = tx.some_key
  )
  GROUP BY some_key;

DELETE dbo.transactions_table
  WHERE some_key IN (SELECT some_key FROM @keys);

COMMIT TRANSACTION;

(Error handling, applicable isolation level, rollback conditions etc. omitted for brevity.)

You do the update first so you don't insert new rows and then update them, performing work twice and possibly double counting. You could use output in both cases to a temp table, perhaps, to then archive/delete rows from the tx table.

I'd caution you to not get too excited about MERGE until they've resolved some of these bugs and you have read enough about it to be sure you're not lulled into any false confidence about how much "better" it is for concurrency and atomicity without additional hints. The race conditions you can work around; the bugs you can't.

Another alternative, from Nikola's comment

CREATE VIEW dbo.TotalsView
WITH SCHEMABINDING
AS
   SELECT some_key_column(s), SUM(amount), COUNT_BIG(*)
    FROM dbo.Transaction_Table
    GROUP BY some_key_column(s);
GO
CREATE UNIQUE CLUSTERED INDEX some_key ON dbo.TotalsView(some_key_column(s));
GO

Now if you want to write queries that grab the totals, you can reference the view directly or - depending on query and edition - the view may automatically be matched even if you reference the base table.

Note: if you are not on Enterprise Edition, you may have to use the NOEXPAND hint to take advantage of the pre-aggregated values materialized by the view.

Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
  • This seems like a valid solution at first glance, but the problem that I see with this solution is that a row which doesn't match the critera to be updated, could match the criteria after any other row has been inserted. The insert statement doesn't allow the double insert, but also won't allow for the updating on the rows which have been inserted, causing quite a few rows to neither be updated nor inserted. – Chuck Russell Mar 12 '13 at 15:34
  • 1
    @Chuck well I think you need to be more specific about what criteria could be missed by both the update and the insert. – Aaron Bertrand Mar 12 '13 at 15:48
  • Well consider the totals table has the following tuples where the left side is the joining criteria and the right side is the data to collect: {(A, 40), (B,30)} Now consider data to be inserted/updated: {(A,2),(B,4),(C,4),(C,6),(A,10),(D,5)} After the update statement, the totals will be {(A,52),(B,34)} and the remaining tuples to process would be {(C,4),(C,6),(D,5)} After the insert statement, we would see that the totals would be {(A,52),(B,34),(C,4),(D,5)} with the remaining tuples {(C,6)} not processed. Does this clarify? – Chuck Russell Mar 12 '13 at 15:53
  • It seems like you think A is processed first, then B, then C. No. SQL Server processes this in a *SET*. Stop thinking about individual rows getting processed individually. – Aaron Bertrand Mar 12 '13 at 16:04
  • I see. Your solution seems to make sense and is well thought out, so I will mark it as correct. Could you perhaps point me in the right direction to adding multiple joining conditions. For instance, we have been dealing solely in key-value pairs, and in actuality, there will be multiple keys, for instance: date of transaction and location of transaction, etc. How could I modify your answer to utilize that sort of multi-valued key? – Chuck Russell Mar 12 '13 at 16:31
  • Instead of just `ON t.some_key = tot.some_key` you use `ON t.col1 = tot.col1 AND t.col2 = tot.col2` or whatever your grouping set is. If you show *actual* table structure, *actual* sample data and how you want *actual* sums calculated, you might get more specific answers. – Aaron Bertrand Mar 12 '13 at 16:32
0

I do not think you need the loop.

You can just

  • Update all rows/sums that match your filters/ groups Archive/ delete previous.
  • Insert all rows that do not match your filter/ groups Archive/ delete previous.

SQL is supposed to use mass data not rows one by one.

TylerH
  • 20,799
  • 66
  • 75
  • 101
Dumitrescu Bogdan
  • 7,127
  • 2
  • 23
  • 31
  • This says what to do, but only in the most general sense. It doesn't even describe what method to use in order to do it. More valuable still would be an example/demonstration of how to do it. – TylerH Sep 16 '20 at 22:54