1

If I have a CTE definition that uses row_number() ordered by a non-unique column, and I reference that CTE twice in my query, is the row_number() value for each row guaranteed to be the same for both references to the CTE?

Example 1:

with tab as (
    select 1 as id, 'john' as name
    union 
    select 2, 'john'
    union 
    select 3, 'brian'
),
ordered1 as (
    select ROW_NUMBER() over (order by name) as rown, id, name
    from tab
)
select o1.rown, o1.id, o1.name, o1.id - o2.id as id_diff
from ordered1 o1
join ordered1 o2 on o2.rown = o1.rown

Output:

+------+----+-------+---------+
| rown | id | name  | id_diff |
+------+----+-------+---------+
|    1 |  3 | brian |       0 |
|    2 |  1 | john  |       0 |
|    3 |  2 | john  |       0 |
+------+----+-------+---------+

Is it guaranteed that id_diff = 0 for all rows?

Example 2:

with tab as (
    select 1 as id, 'john' as name
    union 
    select 2, 'john'
    union 
    select 3, 'brian'
),
ordered1 as (
    select ROW_NUMBER() over (order by name) as rown, id, name
    from tab
),
ordered2 as (
    select ROW_NUMBER() over (order by name) as rown, id, name
    from tab
)
select o1.rown, o1.id, o1.name, o1.id - o2.id as id_diff
from ordered1 o1
join ordered2 o2 on o2.rown = o1.rown

Same output as above when I ran it, but that doesn't prove anything.

Now that I am joining two queries ordered1 and ordered2, can any guarantee be made about the value of id_diff = 0 in the result?

Example queries on http://rextester.com/AQDXP74920

I suspect that there is no guarantee in either case. If there is no such guarantee, then all CTEs using row_number() should always order by a unique combination of columns if the CTE may be referenced more than once in the query.

I have never heard this advice before, and would like some expert opinion.

John Rees
  • 1,553
  • 17
  • 24
  • I was tempted to flag this as a dup of http://stackoverflow.com/questions/18613055/is-order-by-and-row-number-deterministic but wasn't sure if that covered what you are asking. Also, this MSDN has a list of deterministic and non-deterministic functions that might be of help to you https://msdn.microsoft.com/en-us/library/ms178091.aspx – Liesel May 06 '16 at 10:05
  • as long as `order by name` is a _stable_ sort, yes the numbers will be the same. If `name` contains duplicate values, the DBMS is free to choose any order it likes (for rows with the same value). If you want the numbers to be the same in both CTEs you have to use an `order by` expression that is unique in order to get a stable sort. –  May 06 '16 at 10:08
  • @LesH Thanks Les. Similar question I agree, but I think it is worth clarifying this behavior within one query using CTEs. I have been guilty of falling into this trap using CTEs, even though I have known for a long time that two separate queries would have no order guarantees. – John Rees May 06 '16 at 10:17

2 Answers2

3

No, there is no guarantee that ROW_NUMBER on a non-unique sort list returns the same sequence when a CTE is referenced multiple times. It is very likely to happen, but not guranteed, as the CTE is merely a view.

So always make the sort list unique in such a case, e.g. order by name, id.

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73
  • Yes, I think the CTE is a red herring here - it's always true that if you underspecify the `ORDER BY` clause, the server is free to perform processing in any order that complies with *only* your stated requirements. – Damien_The_Unbeliever May 06 '16 at 10:08
  • @Damien_The_Unbeliever The reason I put it in the context of CTEs is that stable order is particularly important within the execution of one query, while it is more obvious that two separate executions of an underspecified order will have no guarantee. – John Rees May 06 '16 at 10:12
  • The CTE is *not* "merely a view". And: since the CTE is guaranteed is guaranteed to be executed only (and exactly) once, there will only be only one ordering, even if it is random. – wildplasser May 06 '16 at 10:17
  • @John Rees: First of all (in case this wasn't clear from my answer): both your queries are the same, because the CTEs specify the same view. Secondly, this has nothing to do with *stable sort*, as you come from unsorted (the table) to sorted (but with an ambiguous order). A stable sort would apply to something sorted to something re-sorted. – Thorsten Kettner May 06 '16 at 10:17
  • 1
    @wildplasser: I doubt that. That would limit the optimizer from finding the optimal execution plan. Where have you read this? – Thorsten Kettner May 06 '16 at 10:20
  • Yes, this is exactly the reason why CTE's (sometimes) execute so slow: the optimiser is not allowed to mess with them (take them apart, reshuffle the terms) – wildplasser May 06 '16 at 10:22
  • The reason I included two examples was mostly so that answers could explain that they are in fact the same! I suspected the answer to my question was exactly what you have said, and I wanted to have some expert confirmation. So thank you. – John Rees May 06 '16 at 10:23
  • @wildplasser - are you [sure?](http://www.sqlservercentral.com/blogs/juggling_with_sql/2011/12/15/how-many-times-the-t-sql-inside-the-cte-is-executed/) "... many users of CTE still have a myth that the t-sql written inside the CTE will be executed only once irrespective of the number of times it is referred in the subsequent CTEs or the related query. " – Damien_The_Unbeliever May 06 '16 at 10:26
  • http://www.postgresql.org/docs/8.4/static/queries-with.html (final paragraph) I tried to get my hands on a SQL-99 standard text but could not find one. – wildplasser May 06 '16 at 10:28
  • @ThorstenKettner Yes, I think I was misusing the term stable sort. Deterministic didn't seem to be the right word, since deterministic only really has meaning when comparing two executions of the same thing, and my question is all about whether two references to a CTE are considered "one thing" or "two things". I am no longer in any doubt! – John Rees May 06 '16 at 10:37
  • @wildplasser: Ah, that's interesting. So in PostgreSQL CTEs *are* guaranteed to be run just once per query. I looked up the Oracle docs and they don't say whether it's the same for Oracle. However, I tried something in Oracle and it showed the optimizer got a worse execution plan for a CTE query than for it's non-CTE equivalent. I also tried to find something in the SQL2003 docs, but must admit I cannot read it. This is some text for lawyers and such, it seems :-) – Thorsten Kettner May 06 '16 at 11:54
  • Conclusion: I still don't know whether the SQL standard dictates a CTE to be execute only once per query. I still doubt it, but PostgreSQL does guarantee it and Oracle may at least have implemented as such. Well, I'd still make sure to have the order by clause unambiguous in order to be on the safe side. – Thorsten Kettner May 06 '16 at 11:58
  • @ThorstenKettner Correction: they are guaranteed to be executed exactly once *only if they are referenced* . (one obvious reason for this is functions with side effects, which are evil of course) I could not find any sql-99 standard text on the web. Semantics may differ, see Gordon Linoff's answer. – wildplasser May 06 '16 at 12:27
  • @wildplasser So given the PostgreSQL promise to run each CTE only once, that would imply that my example 1 would always have id_diff=0, but example 2 would have no such guarantee. Interesting. – John Rees May 06 '16 at 14:37
  • @JohnRees Yes, exactly. Example 2 yields no diffs. in postgres. Even a version with a random() expression (instead of the nondeterministic ORDER BY) in the CTE yields no diffs. – wildplasser May 06 '16 at 15:19
3

The answer that Thorsten gave is correct, I just want to add some more details.

Users of SQL Server often think of CTEs as "temporary tables" or "derived tables. However, they are nothing of the sort. Although some databases do materialize CTEs (at least some of the time), SQL Server never materializes CTEs.

In fact, what happens, is that the CTE logic is inserted into the query -- just as if "replace(, )" were used on the query. This affects non-unique sorting keys. It also affects some non-deterministic functions, such as NEWID().

The advice in your case is simple: Whenever you use order by, include a unique key as the last order by key. You should do this whether order by is used in a window function or for a query. It is just a safe habit to get used to.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks Gordon. It's good to have it reinforced that EVERY reference to a CTE should be considered to be something like a copy/paste into the main query. Even when my brain is trying to convince me that "surely" this is executed only once :-) – John Rees May 06 '16 at 11:04
  • BTW, do you think my question would be better if I deleted Example 2? I originally thought it was an interesting comparison, but I now think it may be distraction from the real question. Good or bad etiquette? – John Rees May 06 '16 at 11:08
  • @johnRees . . . Hmmm. I'm not sure that Example 2 adds much to the question. Your have a pretty simple question that doesn't need a plethora of examples. – Gordon Linoff May 07 '16 at 02:10