5

I have a fairly simple SQL (MySQL):

SELECT foo FROM bar ORDER BY rank, RAND()

I notice that when I refresh the results, the randomness is suspiciously weak.

In the sample data at the moment there are six results with equal rank (integer zero). There are lots of tests for randomness but here is a simple one to do by hand: when run twice, the first result should be the same in both runs about one sixth of the time. This is certainly not happening, the leading result is the same at least a third of the time.

I want a uniform distribution over the permutations. I'm not an expert statistician but I'm pretty sure ORDER BY RAND() should achieve this. What am I missing?

With MySQL, SELECT rand(), rand() shows two different numbers, so I don't buy the "once per query" explanation

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
spraff
  • 32,570
  • 22
  • 121
  • 229

2 Answers2

3

RAND() is only executed once per query. You can verify this by looking at the result set.

If you're trying to get a randomized order, you should be using either NEWID() or CHECKSUM(NEWID()).

WITH T AS ( -- example using RAND()
  SELECT 'Me' Name UNION SELECT 'You' UNION SELECT 'Another'
)
SELECT Name, RAND()
FROM T;

WITH T AS ( -- example using just NEWID()
  SELECT 'Me' Name UNION SELECT 'You' UNION SELECT 'Another'
)
SELECT Name, NEWID()
FROM T;

WITH T AS ( -- example getting the CHECKSUM() of NEWID()
  SELECT 'Me' Name UNION SELECT 'You' UNION SELECT 'Another'
)
SELECT Name, CHECKSUM(NEWID())
FROM T;
Yuck
  • 49,664
  • 13
  • 105
  • 135
  • You're assuming SQL Server? I think they must be on MySQL or they would be getting the same ordering every time. – Martin Smith Apr 04 '12 at 18:17
  • @MartinSmith I added the link, not Yuck, so it's my assumption. – JNK Apr 04 '12 at 18:18
  • 1
    @MartinSmith To be fair, it's my assumption as well. I am, after all, referring to built-in functions unique to MSSQL. – Yuck Apr 04 '12 at 18:19
  • Do not use `NEWID` as a source of randomness. `NEWID` is for generating unique values. Unique is not the same as random! There's no guarantee that it's random (it doesn't have to be, it could just be sequentially allocated and still be unique). There's no guarantee that `CHECKSUM(NEWID)` is random either. No, no, no a thousand times no! – jason Apr 04 '12 at 18:21
  • @Jason That's why it's wrapped with `CHECKSUM()` which produces a **pseudo-random integer**. It may not be cryptographically strong, but it certainly results in a unique ordering each time the query is run. – Yuck Apr 04 '12 at 18:22
  • No. `CHECKSUM` is for balancing hash tables. There is no guarantee it's random or pseudo-random. No, no, no, a million times no! – jason Apr 04 '12 at 18:23
  • 2
    @Jason Yes yes yes. This is an accepted practice when working specifically with MS SQL Server. http://stackoverflow.com/questions/4979799/order-by-newid-how-does-it-work See Martin Smith's answer on the subject, and countless other articles advocating the same. – Yuck Apr 04 '12 at 18:27
  • @Yuck: I don't care. Accepted doesn't mean right. They are wrong. It's an incredibly common misunderstanding. Unique is not random. Period. Checksum is not random. Period. – jason Apr 04 '12 at 18:29
  • 3
    @Jason - So what would you use? (In 2008 `CRYPT_GEN_RANDOM` is available but assume previous versions) – Martin Smith Apr 04 '12 at 18:29
  • @Jason If you want guaranteed randomness you should use a custom function, but this solution is close enough for most scenarios – JNK Apr 04 '12 at 18:30
1

The RAND() can not be refresh for each row. A possible solution might be:

SELECT foo FROM bar ORDER BY rank, CHECKSUM(NEWID())
Arion
  • 31,011
  • 10
  • 70
  • 88
  • No! `NEWID` for unique values, unique is not the same as random. `CHECKSUM` is for balancing hash tables, not for generating random values. There is no guarantee that any of this is random. No, no, no! – jason Apr 04 '12 at 18:24