1

Tl;DR

I get different results when using a CTE as compared to using temp tables. Why?

Problem

Suppose I have the following data structures:

#numbers

A simple number table from 1 to 10.

DROP TABLE IF EXISTS #numbers;
WITH rcte(n) AS (
  SELECT 1 AS n

   UNION ALL

  SELECT n + 1 AS n
    FROM rcte
   WHERE n < 10
)
SELECT n
  INTO #numbers
  FROM rcte;

SELECT TOP 3 n
  FROM #numbers;
n
1
2
3

#max

I can now use #numbers to create a table with 3 rows and columns id_col and max_nr, a random number between 2 and 5:

DROP TABLE IF EXISTS #max;
DECLARE @min TINYINT = 2
        , @max TINYINT = 5;
SELECT CONCAT(N'id_', n) AS id_col
       , ABS(CHECKSUM(NEWID())) % (@max - @min + 1) + @min AS max_nr
  INTO #max
  FROM #numbers
 WHERE n <= 3;

SELECT *
  FROM #max; --- max_nr will of course look different each run
id_col max_nr
id_1 5
id_2 2
id_3 3

#result

Now I can easily generate the table I need, namely a table where for each id_col I have each number up to max_nr

DROP TABLE IF EXISTS #result;

SELECT id_col
       , n
  FROM #max
 INNER JOIN #numbers
    ON n <= max_nr;
id_col n
id_1 1
id_1 2
id_1 3
id_1 4
id_1 5
id_2 1
id_2 2
id_3 1
id_3 2
id_3 3

CTE

If I try to do the same with A CTE I get sometimes (depends on the random number) strange results. My guess is that the random part is recalculated for each row in the join leading to different results, if this is the case, is the only solution to use a temp table or any chance to solve that with a CTE?

N.B. Re run the following code a couple of times and you will eventually see that there are "holes" i.e. missing numbers:

WITH cte_max (id_col, max_nr) AS (
  SELECT CONCAT(N'id_', n) AS id_col
         , ABS(CHECKSUM(NEWID())) % (@max - @min + 1) + @min AS max_nr
    FROM #numbers
   WHERE n <= 3
)
SELECT id_col
       , n
  FROM cte_max
 INNER JOIN #numbers
    ON n <= max_nr;

For instance I got this result in one run:

id_col n comment
id_1 1
id_1 2
id_1 4 3 missing!!
id_1 5
id_2 1
id_2 2
id_2 3
id_3 1
id_3 2

You see that the three is missing. This won't happen for all the runs (sometimes all numbers are in order, but that they are missing from time to time, I clearly see that I am overlooking something)

Question

Can somebody explain to me, why this behaviour is to be expected? Has it something to do how SQL Server parallelizes its tasks and in the inner join the random number is yet not materialized and gets re-calculated for each join result? If so, is there a work around beside using temp tables?

Dale K
  • 25,246
  • 15
  • 42
  • 71
thothal
  • 16,690
  • 3
  • 36
  • 71
  • 4
    *"My guess is that the random part is recalculated for each row"* That is exactly how `NEWID` works. `WITH CTE AS( SELECT NEWID() AS ID) SELECT ID FROM CTE CROSS JOIN (VALUES(1),(2),(3),(4),(5))V(I);` will give you *five* different `ID` values, not 1. In this case, the value of `NEWID` isn't determined until the *final* `SELECT` statement. – Thom A Aug 23 '23 at 07:53
  • 2
    `RAND()` behaves differently however and is the same for all rows - so maybe you could use that instead though I haven't looked at detail in what you are doing – Martin Smith Aug 23 '23 at 08:09
  • @Thom ahh, thanks for the explanation., this makes sense and explains the *WHY*. Do you have an idea of how to generate a random integer then? `RAND(n)` would work, but it is deterministic? In any case this seems to be new question, if you don't mind to put your comment into an answer, I'd happily accept it. Thanks! – thothal Aug 23 '23 at 08:25
  • @Martin Yes I was toying with the same idea and tried `RAND(n)` this would give different results per `id_col` but is deterministic in a sense that it always returns the same "random" number then. – thothal Aug 23 '23 at 08:28

1 Answers1

0

RAND() does a single random and we can rely on its value inside a WITH:

create table numbers(nr int);
insert into numbers(nr)
values
(1),
(2),
(3),
(4),
(5),
(6),
(7),
(8),
(9),
(10);

create table other_numbers(nr int);
insert into other_numbers(nr)
values
(1),
(2),
(3),
(4),
(5),
(6),
(7),
(8),
(9),
(10);
WITH rand_num (nr, randomized) AS (
  SELECT nr,
         CEILING(10 * RAND()) AS random
    FROM numbers
)
SELECT rand_num.nr AS nr1, other_numbers.nr AS nr2, rand_num.randomized
FROM rand_num
JOIN other_numbers
ON rand_num.nr = other_numbers.nr;

enter image description here

Fiddle: http://sqlfiddle.com/#!18/8e80d6/6

EDIT

You can also apply NEWID for groups, like:

WITH rand_num (nr, randomized) AS (
  SELECT nr % 3,
         MAX(NEWID()) AS random
    FROM numbers
    GROUP BY nr % 3
)
SELECT rand_num.nr AS nr1, other_numbers.nr AS nr2, rand_num.randomized
FROM rand_num
JOIN other_numbers
ON rand_num.nr = other_numbers.nr % 3;

enter image description here

Fiddle: http://sqlfiddle.com/#!18/8e80d6/19

Lajos Arpad
  • 64,414
  • 37
  • 100
  • 175
  • 1
    Well, the problem is not to generate a single random number, but to generate a random number per group which is later stable. After reading a lot of articles, the safest and cleanest bet is to materialize first the random numbers and then to reuse them like in my workg example. – thothal Aug 23 '23 at 08:50
  • @thothal thanks for your comment. I have edited my answer with generating a random number per group. Here I have applied `MAX`, but you can apply the aggregation that fits your needs most. – Lajos Arpad Aug 23 '23 at 10:15