6

The problem

I'm generating a random number for each row in a table #Table_1 in a CTE, using this technique. I'm then joining the results of the CTE on another table, #Table_2. Instead of getting a random number for each row in #Table_1, I'm getting a new random number for every resulting row in the join!

CREATE TABLE #Table_1 (Id INT)

CREATE TABLE #Table_2 (MyId INT, ParentId INT)

INSERT INTO #Table_1
VALUES (1), (2), (3)

INSERT INTO #Table_2
VALUES (1, 1), (2, 1), (3, 1), (4, 1), (1, 2), (2, 2), (3, 2), (1, 3)


;WITH RandomCTE AS
(
    SELECT Id, (ABS(CHECKSUM(NewId())) % 5)RandomNumber
    FROM #Table_1
)
SELECT r.Id, t.MyId, r.RandomNumber
FROM RandomCTE r
INNER JOIN #Table_2 t
    ON r.Id = t.ParentId

The results

Id          MyId        RandomNumber
----------- ----------- ------------
1           1           1
1           2           2
1           3           0
1           4           3
2           1           4
2           2           0
2           3           0
3           1           3

The desired results

Id          MyId        RandomNumber
----------- ----------- ------------
1           1           1
1           2           1
1           3           1
1           4           1
2           1           4
2           2           4
2           3           4
3           1           3

What I tried

I tried to obscure the logic of the random number generation from the optimizer by casting the random number to VARCHAR, but that did not work.

What I don't want to do

I'd like to avoid using a temporary table to store the results of the CTE.

How can I generate a random number for a table and preserve that random number in a join without using temporary storage?

elizabk
  • 480
  • 2
  • 11
  • Your query should return 4 columns, not 3. Two columns from `RandomCTE` and two columns from `Table_2`. – The Impaler Mar 20 '19 at 20:05
  • Interesting problem. May I ask why you don't want to use a temp table? I mean, tempdb is used for all sorts of things... and are you trying to force a *different* random number for each ID in table 1? – S3S Mar 20 '19 at 20:09
  • @TheImpaler You’re right, fixed – elizabk Mar 20 '19 at 20:10
  • @scsimon I solved it with a temp table but I’m looking for a cleaner solution – elizabk Mar 20 '19 at 20:15
  • 1
    the temp table *is* the cleanest solution. Rather than bending your SQL into strange shapes trying to workaround the optimiser and getting a result not guaranteed to always behave as you want – Martin Smith Mar 20 '19 at 20:40

3 Answers3

2

This seems to do the trick:

WITH CTE AS(
    SELECT Id, (ABS(CHECKSUM(NewId())) % 5)RandomNumber
    FROM #Table_1),
RandomCTE AS(
    SELECT Id,
           RandomNumber
    FROM CTE
    GROUP BY ID, RandomNumber)
SELECT *
FROM RandomCTE r
INNER JOIN #Table_2 t
    ON r.Id = t.ParentId;

It looks like SQL Server is aware that, at the point of being outside the CTE, that RandomNumber is effectively just NEWID() with some additional functions wrapped around it (DB<>Fiddle), and hence it still generates a unique ID for each row. The GROUP BY clause in the second CTE therefore forces the data engine to define RandomNumber a value so it can perform the GROUP BY.

Thom A
  • 88,727
  • 11
  • 45
  • 75
  • interesting though I'm betting they want each ID in table 1 to have it's own unique number, that doesn't overlap other IDs. – S3S Mar 20 '19 at 20:12
  • I hope not, @scsimon , as stopping the reuse of a random number would be **awfully** messy. And means that the number isn't truly "random" any more. *"pick a number between 1-5"* and *"pick a number between 1-5, but you can't have 1,2,3, or 5 because they've been chosen"* isn't a very random choice at all. I can say for certainty that you'll pick "4". :) – Thom A Mar 20 '19 at 20:14
  • 1
    Nope, I’m looking for a truly random pick. The fact that they didn’t overlap in my desired results example was, well, random. – elizabk Mar 20 '19 at 20:19
  • oh i don't disagree, i'm just basing off that *expected result* where the random number is for each ID. With a temp table, each ID could only have 1 number, and this would make sense for the OP. Looks like you nailed it Larnu. – S3S Mar 20 '19 at 20:20
  • It will continue to work until in SQL Server 2023 optimiser becomes smart enough to realise that `SELECT Col1, Col2 FROM T GROUP BY Col1, Col2` can be optimised away... – Vladimir Baranov Mar 20 '19 at 23:53
  • 1
    or a unique index is added to Col1 with current optimiser. https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=c723616c650626f5cfbcfb9bc0dc7350 – Martin Smith Mar 21 '19 at 07:54
2

Per the quote in this answer

The optimizer does not guarantee timing or number of executions of scalar functions. This is a long-estabilished tenet. It's the fundamental 'leeway' tha allows the optimizer enough freedom to gain significant improvements in query-plan execution.

If it is important for your application that the random number be evaluated once and only once you should calculate it up front and store it into a temp table.

Anything else is not guaranteed and so is irresponsible to add into your application's code base - as even if it works now it may break as a result of a schema change/execution plan change/version upgrade/CU install.

For example Lamu's answer breaks if a unique index is added to #Table_1 (Id)

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
0

How about not using a real random number at all? Use rand() with a seed:

WITH RandomCTE AS (
      SELECT Id,
             CONVERT(INT, RAND(ROW_NUMBER() OVER (ORDER BY NEWID()) * 999999) * 5) as RandomNumber
      FROM #Table_1
     )
SELECT r.Id, t.MyId, r.RandomNumber
FROM RandomCTE rINNER JOIN
     #Table_2 t
     ON r.Id = t.ParentId;

The seed argument to rand() is pretty awful. Values of the seed near each other produce similar initial values, which is the reason for the multiplication.

Here is the db<>fiddle.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Does this have a practical effect on how random the number is? – elizabk Mar 20 '19 at 21:01
  • @elizabk . . . Unfortunately, it does. If you remove the multiplication, you get the same number for low ids. You can check that the random value is returning different values -- but they are very close together. – Gordon Linoff Mar 20 '19 at 23:35