48

I know that If I run this query

select top 100 * from mytable order by newid()

it will get 100 random records from my table.

However, I'm a bit confused as to how it works, since I don't see newid() in the select list. Can someone explain? Is there something special about newid() here?

Michael
  • 8,362
  • 6
  • 61
  • 88
Tola Odejayi
  • 3,019
  • 9
  • 31
  • 46
  • Note that this is a slow way to get 100 random entries unless the db server recognizes this as a known pattern to optimize. – CodesInChaos Feb 12 '11 at 18:37
  • It's also only pseudo-random. If you need true randomness for security, never use this method. – Justin Morgan - On strike Feb 12 '11 at 18:40
  • 2
    The columns in your `ORDER BY` clause do not need to appear in your `SELECT` clause in SQL Server. – Gabe Feb 12 '11 at 18:51
  • 1
    You should also be wary when using this technique on large tables as it will use tempdb [as mentioned in this article](http://msdn.microsoft.com/en-us/library/cc441928.aspx), which does offer an alternative method. – domager Apr 25 '13 at 13:40

5 Answers5

39

I know what NewID() does, I'm just trying to understand how it would help in the random selection. Is it that (1) the select statement will select EVERYTHING from mytable, (2) for each row selected, tack on a uniqueidentifier generated by NewID(), (3) sort the rows by this uniqueidentifier and (4) pick off the top 100 from the sorted list?

Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.

SELECT TOP 100 * 
FROM master..spt_values 
ORDER BY NEWID()

The compute scalar operator adds the NEWID() column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.

SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N sort operator which attempts to perform the entire sort operation in memory (for small values of N)

Plan

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
  • Got it! And yes, you're right - once I've determined the top 100 rows from the entire set, there's no need to sort the rest. – Tola Odejayi Feb 12 '11 at 22:08
  • So, is it safe to ensure that no data is written? Since this is a SELECT query, the NEWID() will calculate a randomized identifier just for the query, it won't be updating anything in the database with this new id, right? – K09P Mar 25 '19 at 11:40
  • 1
    Yes it won't affect the tables you are selecting from. At least some of the data will be temporarily written to a worktable in tempdb to hold at least the TOP N results but nothing written to the user database – Martin Smith Mar 25 '19 at 12:36
  • I don't see it mentioned anywhere else, but because of how this works, this is probably a terrible way of selecting random rows from a table. If you have a table with many rows (i.e. half a billion), do not run a query like that. If I ran this on a table with 500GB of data in it, I'd be in trouble. Better off using the built-in feature "TABLESAMPLE" that's meant for selecting random rows from data pages: `SELECT * FROM Person.Person TABLESAMPLE (10 PERCENT);` – Triynko Apr 28 '23 at 23:35
12

In general it works like this:

  • All rows from mytable is "looped"
  • NEWID() is executed for each row
  • The rows are sorted according to random number from NEWID()
  • 100 first row are selected
Uhlen
  • 1,778
  • 15
  • 29
4

as MSDN says:

NewID() Creates a unique value of type uniqueidentifier.

and your table will be sorted by this random values.

Saeed Amiri
  • 22,252
  • 5
  • 45
  • 83
  • 1
    Thanks - I know what NewID() does, I'm just trying to understand how it would help in the random selection. Is it that [1] the select statement will select EVERYTHING from mytable, [2] for each row selected, tack on a uniqueidentifier generated by NewID(), [3] sort the rows by this uniqueidentifier and [4] pick off the top 100 from the sorted list? – Tola Odejayi Feb 12 '11 at 19:50
1

I have an unimportant query which uses newId() and joins many tables. It returns about 10k rows in about 3 seconds. So, newId() might be ok in such cases where performance is not too bad & does not have a huge impact. But, newId() is bad for large tables.

Here is the explanation from Brent Ozar's blog - https://www.brentozar.com/archive/2018/03/get-random-row-large-table/.

From the above link, I have summarized the methods which you can use to generate a random id. You can read the blog for more details.

4 ways to get a random row from a large table:

  1. Method 1, Bad: ORDER BY NEWID() > Bad performance!
  2. Method 2, Better but Strange: TABLESAMPLE > Many gotchas & is not really random!
  3. Method 3, Best but Requires Code: Random Primary Key > Fastest, but won't work for negative numbers.
  4. Method 4, OFFSET-FETCH (2012+) > Only performs properly with a clustered index.

More on method 3: Get the top ID field in the table, generate a random number, and look for that ID. For top N rows, call the code below N times or generate N random numbers and use in an IN clause.

/* Get a random number smaller than the table's top ID */
DECLARE @rand BIGINT;
DECLARE @maxid INT = (SELECT MAX(Id) FROM dbo.Users);
SELECT @rand = ABS((CHECKSUM(NEWID()))) % @maxid;

/* Get the first row around that ID */
SELECT TOP 1 *
FROM dbo.Users AS u
WHERE u.Id >= @rand;
DigitalNomad
  • 131
  • 3
  • I find method 3 to be the fastest. But I have a problem when I get the top N. I ran it N times, but there will be several times where the data will overlap so it's not really random N values. Do you have any ideas?- – Lê Duy Thứ Apr 27 '23 at 09:37
-1

use select top 100 randid = newid(), * from mytable order by randid you will be clarified then..

Sukhi
  • 45
  • 1
  • 7