How can I get a random number generated in a CTE not to change in JOIN?

Question

The problem

I'm generating a random number for each row in a table #Table_1 in a CTE, using this technique. I'm then joining the results of the CTE on another table, #Table_2. Instead of getting a random number for each row in #Table_1, I'm getting a new random number for every resulting row in the join!

CREATE TABLE #Table_1 (Id INT)

CREATE TABLE #Table_2 (MyId INT, ParentId INT)

INSERT INTO #Table_1
VALUES (1), (2), (3)

INSERT INTO #Table_2
VALUES (1, 1), (2, 1), (3, 1), (4, 1), (1, 2), (2, 2), (3, 2), (1, 3)


;WITH RandomCTE AS
(
    SELECT Id, (ABS(CHECKSUM(NewId())) % 5)RandomNumber
    FROM #Table_1
)
SELECT r.Id, t.MyId, r.RandomNumber
FROM RandomCTE r
INNER JOIN #Table_2 t
    ON r.Id = t.ParentId

The results

Id          MyId        RandomNumber
----------- ----------- ------------
1           1           1
1           2           2
1           3           0
1           4           3
2           1           4
2           2           0
2           3           0
3           1           3

The desired results

Id          MyId        RandomNumber
----------- ----------- ------------
1           1           1
1           2           1
1           3           1
1           4           1
2           1           4
2           2           4
2           3           4
3           1           3

What I tried

I tried to obscure the logic of the random number generation from the optimizer by casting the random number to VARCHAR, but that did not work.

What I don't want to do

I'd like to avoid using a temporary table to store the results of the CTE.

How can I generate a random number for a table and preserve that random number in a join without using temporary storage?

Your query should return 4 columns, not 3. Two columns from `RandomCTE` and two columns from `Table_2`. — The Impaler, Mar 20 '19 at 20:05
Interesting problem. May I ask why you don't want to use a temp table? I mean, tempdb is used for all sorts of things... and are you trying to force a *different* random number for each ID in table 1? — S3S, Mar 20 '19 at 20:09
@scsimon I solved it with a temp table but I’m looking for a cleaner solution — elizabk, Mar 20 '19 at 20:15
the temp table *is* the cleanest solution. Rather than bending your SQL into strange shapes trying to workaround the optimiser and getting a result not guaranteed to always behave as you want — Martin Smith, Mar 20 '19 at 20:40

Thom A · Answer 1 · 2019-03-20T20:17:38.247

2

This seems to do the trick:

WITH CTE AS(
    SELECT Id, (ABS(CHECKSUM(NewId())) % 5)RandomNumber
    FROM #Table_1),
RandomCTE AS(
    SELECT Id,
           RandomNumber
    FROM CTE
    GROUP BY ID, RandomNumber)
SELECT *
FROM RandomCTE r
INNER JOIN #Table_2 t
    ON r.Id = t.ParentId;

It looks like SQL Server is aware that, at the point of being outside the CTE, that RandomNumber is effectively just NEWID() with some additional functions wrapped around it (DB<>Fiddle), and hence it still generates a unique ID for each row. The GROUP BY clause in the second CTE therefore forces the data engine to define RandomNumber a value so it can perform the GROUP BY.

edited Mar 20 '19 at 20:17

answered Mar 20 '19 at 20:09

Thom A

88,727
11
45
75

interesting though I'm betting they want each ID in table 1 to have it's own unique number, that doesn't overlap other IDs. – S3S Mar 20 '19 at 20:12
I hope not, @scsimon , as stopping the reuse of a random number would be **awfully** messy. And means that the number isn't truly "random" any more. *"pick a number between 1-5"* and *"pick a number between 1-5, but you can't have 1,2,3, or 5 because they've been chosen"* isn't a very random choice at all. I can say for certainty that you'll pick "4". :) – Thom A Mar 20 '19 at 20:14
1

Nope, I’m looking for a truly random pick. The fact that they didn’t overlap in my desired results example was, well, random. – elizabk Mar 20 '19 at 20:19
oh i don't disagree, i'm just basing off that *expected result* where the random number is for each ID. With a temp table, each ID could only have 1 number, and this would make sense for the OP. Looks like you nailed it Larnu. – S3S Mar 20 '19 at 20:20
It will continue to work until in SQL Server 2023 optimiser becomes smart enough to realise that `SELECT Col1, Col2 FROM T GROUP BY Col1, Col2` can be optimised away... – Vladimir Baranov Mar 20 '19 at 23:53
1

or a unique index is added to Col1 with current optimiser. https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=c723616c650626f5cfbcfb9bc0dc7350 – Martin Smith Mar 21 '19 at 07:54

score 2 · Answer 2 · answered Mar 21 '19 at 10:27

Per the quote in this answer

The optimizer does not guarantee timing or number of executions of scalar functions. This is a long-estabilished tenet. It's the fundamental 'leeway' tha allows the optimizer enough freedom to gain significant improvements in query-plan execution.

If it is important for your application that the random number be evaluated once and only once you should calculate it up front and store it into a temp table.

Anything else is not guaranteed and so is irresponsible to add into your application's code base - as even if it works now it may break as a result of a schema change/execution plan change/version upgrade/CU install.

For example Lamu's answer breaks if a unique index is added to #Table_1 (Id)

Gordon Linoff · Answer 3 · 2019-03-20T20:34:00.493

0

How about not using a real random number at all? Use rand() with a seed:

WITH RandomCTE AS (
      SELECT Id,
             CONVERT(INT, RAND(ROW_NUMBER() OVER (ORDER BY NEWID()) * 999999) * 5) as RandomNumber
      FROM #Table_1
     )
SELECT r.Id, t.MyId, r.RandomNumber
FROM RandomCTE rINNER JOIN
     #Table_2 t
     ON r.Id = t.ParentId;

The seed argument to rand() is pretty awful. Values of the seed near each other produce similar initial values, which is the reason for the multiplication.

Here is the db<>fiddle.

edited Mar 20 '19 at 20:34

answered Mar 20 '19 at 20:21

Gordon Linoff

1,242,037
58
646
786

Does this have a practical effect on how random the number is? – elizabk Mar 20 '19 at 21:01
@elizabk . . . Unfortunately, it does. If you remove the multiplication, you get the same number for low ids. You can check that the random value is returning different values -- but they are very close together. – Gordon Linoff Mar 20 '19 at 23:35

How can I get a random number generated in a CTE not to change in JOIN?

3 Answers3