2

I've created a complex process that uses several CTEs (mainly for recursive hierarchical work).
On small sample sets of data everything goes as expected but when I apply the code to large sets of data I received unexpected (and wrong) results.

I think I've narrowed it done to the CTEs. The recursive CTEs are "fed" data processed in several earlier CTEs, and that seems to be the problem.

I set up a sample data set as follows:

  • Four rows with unique data
  • Each row receives a random row number (this is added in a CTE

I then take the results of the first CTE and perform a self-join in a second CTE.
I expected all rows to join, each to itself. What actually happens is that unequal rows join up.

Can someone offer up an explanation for this behaviour?

Community
  • 1
  • 1
Gidil
  • 4,137
  • 2
  • 34
  • 50
  • Possible duplicate of http://stackoverflow.com/questions/3511353/how-many-times-are-the-results-of-this-common-table-expression-evaluated – RichardTheKiwi Oct 15 '12 at 08:27

1 Answers1

9

There's nothing unexpected about that result, except maybe if you don't understand it.

Each CTE is resolved each and every time it is referenced. Yes, this is why it is possible for a simple CTE on a highly transactional table to return 4 rows in one reference and 5 rows in the next 2 levels down.

On your sample however, it's because of the ORDER BY NEWID(), giving each resolution of the original CTE a different row_number()-ing. Did you think CTEs are stored in memory and cached? On the SQLFiddle site, under your results, there is a "view execution plan" link. It shows 2 distinct, separate scans of the table.

RichardTheKiwi
  • 105,798
  • 26
  • 196
  • 262
  • 2
    Good answer. OP the best way to achieve what you want is to store the results of TEST_CTE in a temporary table. Once you've added the data and stored the row_number ordered by newid(), you can reference it as many times as you want and and you will now that the rows will be joined in a deterministic manner – Tobsey Oct 15 '12 at 08:27
  • 1
    Please can you evidence this part of you answer? `Yes, this is why it is possible for a simple CTE on a highly transactional table to return 4 rows in one reference and 5 rows in the next 2 levels down.` As they're in the same over-all query, isn't there a single implicit transaction around all references? [*I agree that `ROW_NUMBER()` is resolved twice in the OPs example, but I do not understand how the contents of the tables could change mid-query? (Without messing with isolation levels, etc.) I'm not saying you're wrong, just that I'm scared by this news and would love to see evidence.]* – MatBailie Oct 15 '12 at 08:48
  • @Dems - All read committed does is ensure that rows aren't read if they are dirty by aquiring an `S` lock on it then releasing it as soon as it is read. You would need repeatable read to keep hold of the lock till the end of the statement or serializable to lock the range and prevent the possibility of a new row being inserted that would match the predicate. Under either snapshot isolation you would also see a consistent view of the data (but that wouldn't help the OP's newid() issue) – Martin Smith Oct 15 '12 at 16:07