Get random data from SQL Server without performance impact

Question

I need to select random rows from my sql table, when search this cases in google, they suggested to ORDER BY NEWID() but it reduces the performance. Since my table has more than 2'000'000 rows of data, this solution does not suit me.

I tried this code to get random data :

SELECT TOP 10 * 
FROM Table1
WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) AS INT)) % 100) < 10

It also drops performance sometimes.

Could you please suggest good solution for getting random data from my table, I need minimum rows from that tables like 30 rows for each request. I tried TableSAMPLE to get the data, but it returns nothing once I added my where condition because it return the data by the basis of page not basis of row.

[Check this, I think this helps you](https://stackoverflow.com/questions/19412/how-to-request-a-random-row-in-sql) — Dogan, Oct 09 '17 at 11:55
I would suggest to move your primary key to an identity column and to keep your uniqueidentifier as alternate key. It will be very useful and you will see performances improvement in all conditions. Just consider that if your primary key is `CLUSTERED` your new records will be inserted in the middle of the table and all the table reorganized. With identity PK your records will be appended to the tail. — MtwStark, Oct 09 '17 at 13:09
@MtwStark I am okay with change my primary key as identity for performance improvement, could you please suggest some for your thoughts — Manikandan, Oct 10 '17 at 01:24
you can look at my answer, it is appllyable both with identity and uniqueidentifier case — MtwStark, Oct 10 '17 at 07:40

MtwStark · Accepted Answer · 2017-10-11T09:54:40.527

1

Try to calc the random ids before to filter your big table.
since your key is not identity, you need to number records and this will affect performances..

Pay attention, I have used distinct clause to be sure to get different numbers

EDIT: I have modified the query to use an arbitrary filter on your big table

declare @n int = 30

;with
t as (
    -- EXTRACT DATA AND NUMBER ROWS
    select *, ROW_NUMBER() over (order by YourPrimaryKey) n
    from YourBigTable t
    -- SOME FILTER
    WHERE 1=1  /* <-- PUT HERE YOUR COMPLEX FILTER LOGIC */
),
r as (
    -- RANDOM NUMBERS BETWEEN 1 AND COUNT(*) OF FILTERED TABLE
    select distinct top (@n) abs(CHECKSUM(NEWID()) % n)+1  rnd
    from sysobjects s
    cross join (SELECT MAX(n) n FROM t) t
)
select t.*
from t
join r on r.rnd =  t.n

edited Oct 11 '17 at 09:54

answered Oct 09 '17 at 12:22

MtwStark

3,866
1
18
32

Since my table don't have id value for integer, i used Uniqueidentifier – Manikandan Oct 09 '17 at 12:27
Nice, is it good for big tables, Since my table has more than 2'000'000 rows – Manikandan Oct 10 '17 at 08:35
in my tests it is very very fast (250ms), also with uniqueidentifier as primarykey, and you can improve it much more replacing the `COUNT(*)` for `@max` with a more efficient function and get ~30% less – MtwStark Oct 10 '17 at 09:34
Its really nice, number will generate by the basis of @max values, in my case i add some where conditions for pick the data, so i need to duplicate this condition get max count and random pick, Is there any other way to avoid this – Manikandan Oct 10 '17 at 10:44
how many rows you have after the where condition? 1'000, 100K, 500K, 1M, 1.8M? – MtwStark Oct 10 '17 at 10:58
I will get 100k rows after the where condition and one more think, above code rare to pick first and second row. – Manikandan Oct 11 '17 at 02:08
for where condition, check new version.. and, what you mean about picking 1st and 2nd row? – MtwStark Oct 11 '17 at 09:48
Hey, it was great. I will check your code against my table. My question was its pick random data, consider my row numbers are sequence from 1 to 100. In the above code pick random rows. its fine. i executed more 50 times this query but i never seen my 1st row and 2nd row, it picks mostly row number 10 or more – Manikandan Oct 12 '17 at 07:33
this is normal, they calls it Gaussian Distribution, it is very difficult to pick boundaries when you pick 30 numbers out of 2mil rows (or 100k rows) also, be sure your where condition filter include them or you'll never get them! :) – MtwStark Oct 12 '17 at 07:46

score -1 · Answer 2 · answered Oct 09 '17 at 16:17

-1

If your uniqueidentifier key is a random GUID (not generated with NEWSEQUENTIALID() or UuidCreateSequential), you can use the method below. This will use the clustered primary key index without sorting all rows.

SELECT t1.*
FROM (VALUES(
      NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
    ,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
    ,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())) AS ThirtyKeys(ID)
CROSS APPLY(SELECT TOP (1) * FROM dbo.Table1 WHERE ID >= ThirtyKeys.ID) AS t1;

answered Oct 09 '17 at 16:17

Dan Guzman

43,250
3
46
71

Hey my primary key was new sequentialId – Manikandan Oct 10 '17 at 01:23
you can extract 30 times the same record – MtwStark Oct 10 '17 at 07:32
and also you can get less than 30 records – MtwStark Oct 10 '17 at 07:39

Get random data from SQL Server without performance impact

2 Answers2