2

I want to return rows order by random from a table with large number of rows to be scanned

Tried:

1) select * from table order by rand() limit 1

2) select * from table where id in (select id from table order by rand() limit 1)

2 is faster than 1 but still too slow on table with large rows

Update: Query is used in real time app. Insert, select and update are roughly 10/sec. So caching will not be the ideal solution. Rows required for this specific case is 1. But looking for a general solution as well where query is fast and number of rows required>1

Shubham
  • 1,288
  • 2
  • 11
  • 32
  • 2
    I am not sure why someone voted to close this question, to me it is very clear and well-written. – We Are All Monica Aug 27 '18 at 06:02
  • 1
    are you using something else? php, python, ruby? – Erubiel Aug 27 '18 at 07:02
  • @Erubiel the result expected should be purely mysql based – Shubham Aug 27 '18 at 07:58
  • Are you sure that query #2 is faster than query #1? Have you run both queries multiple times in different order? Are the execution plans different from each other? This would mean the optimizer is doing a very bad job here. – Thorsten Kettner Aug 27 '18 at 07:59
  • @ThorstenKettner yes 2 is faster. Reason is clear as well as in 1 it needs to randomise whole row where as in 2 it just needs randomise primary key (may be its using indexes, not tested) – Shubham Aug 27 '18 at 08:00
  • I guess that Erubiel has in mind that if you want this several times, it may make sense to select more rows at a time from the database and cache them in your app for the next access. – Thorsten Kettner Aug 27 '18 at 08:01
  • but after caching, results will not be real time randomization – Shubham Aug 27 '18 at 08:02
  • @Shubham: 1. No, the DBMS is free to choose whatever way to achieve the result. Why should it read all rows when it only needs to read one result row? Have you looked at the execution plans? Are they different? 2. That's true. Does the table content change so often? You could use some timestamp; if the last query was submitted more than, say, five minutes ago, then query again, else use the cache. – Thorsten Kettner Aug 27 '18 at 08:05
  • How are you using the query. Do you want one row only? How often will you call it? With what frequency? How often does the table content change? – Thorsten Kettner Aug 27 '18 at 08:05
  • @ThorstenKettner updated question – Shubham Aug 27 '18 at 08:09

4 Answers4

0

Fastest way is using prepared statement in mysql and limit

select @offset:=floor(rand()*total_rows_in_table);
PREPARE STMT FROM 'select id from table limit ?,1';
EXECUTE STMT USING @offset; 

total_rows_in_table= total rows in table.

It is much faster as compared to above two.

Limitation: Fetching more than 1 rows is not truly random.

Shubham
  • 1,288
  • 2
  • 11
  • 32
0

Generate a random set of IDs before doing the query (you can also get MAX(id) very quickly if you need it). Then do the query as id IN (your, list). This will use the index to look only at the IDs you requested, so it will be very fast.

Limitation: if some of your randomly chosen IDs don't exist, the query will return less results, so you'll need to do these operations in a loop until you have enough results.

We Are All Monica
  • 13,000
  • 8
  • 46
  • 72
  • this has another limitation. it will be based on language in which code is being written and power to generate random numbers efficiently – Shubham Aug 27 '18 at 07:58
0

If you can run two querys in the same "call" you can do something like this, sadly, this asumes there are no deleted records in your database... if they where some query's would not return anything.

I tested with some local records and the fastest i could do was this... that said i tested it on a table with no deleted rows.

SET @randy = CAST(rand()*(SELECT MAX(id) FROM yourtable) as UNSIGNED);

SELECT *
FROM yourtable
WHERE id = @randy;

Another solution that came from modifying a little the answer to this question, and from your own solution: Using variables as OFFSET in SELECT statments inside mysql's stored functions

SET @randy = CAST(rand()*(SELECT MAX(id) FROM yourtable) as UNSIGNED);
SET @q1 = CONCAT('SELECT * FROM yourtable LIMIT 1 OFFSET ', @randy);
PREPARE stmt1 FROM @q1;
EXECUTE stmt1;
Erubiel
  • 2,934
  • 14
  • 32
  • 2
    The idea is good, but I would never rely on IDs being contiguous. You don't even have to delete records in order to create gaps. Any error on insert would roll back the transaction and dismiss the IDs that were supposed to be used. – Thorsten Kettner Aug 27 '18 at 07:39
  • You are right sir, maybe wrapped with MAX(ID) and a WHILE this result is NULL? i don't know how i would write that by the way... for the meantime i'll correct my answer with MAX(id).. i also removed the SELECT * explanation... although it lowers the time, is not that relevant when you are retrieving just one column – Erubiel Aug 27 '18 at 07:41
  • how is it faster? both uses same statements. just syntactically different – Shubham Aug 27 '18 at 08:17
  • Let me do a couple more runs – Erubiel Aug 27 '18 at 08:17
  • do add sql_no_cache in query to disable in cache results before testinf – Shubham Aug 27 '18 at 08:18
  • its on the same range... although, mine calculates the max id hehehe ;) but, still no answer to the multiple rows problem... – Erubiel Aug 27 '18 at 08:20
0

I imagine a table with, say, a million entries. You want to pick a row randomly, so you generate one random number per row, i.e. a million random numbers, and then seek the row with the minimum generated number. There are two tasks involved:

  1. generating all those numbers
  2. finding the minimum number

and then accessing the record of course.

If you wanted more than one row, the DBMS could sort all records and then return n records, but hopefully it would rather apply some part-sort operation where it only detects the n minimum numbers. Quite some task anyway.

There is no thorough way to circumvent this, I guess. If you want random access, this is the way to go.

If you would be ready to live with a less random result, however, I'd suggest to make ID buckets. Imagine ID buckets 000000-0999999, 100000-1999999, ... Then randomly choose one bucket and of this pick your random rows. Well, admittedly, this doesn't look very random and you would either get only old or only new records with such buckets; but it illustrates the technique.

Instead of creating the buckets by value, you'd create them with a modulo function. id % 1000 would give you 1000 buckets. The first with IDs xxx000, the second with IDs xxx001. This would solve the new/old records thing and get the buckets balanced. As IDs are a mere technical thing, it doesn't matter at all that the drawn IDs look so similar. And even if that bothers you, then don't make 1000 buckets, but say 997.

Now create a computed column:

alter table mytable add column bucket int generated always as (id % 997) stored;

Add an index:

create index idx on mytable(bucket);

And query the data:

select *
from mytable
where bucket = floor(rand() * 998)
order by rand()
limit 10;

Only about 0.1% of the table gets into the sorting here. So this should be rather fast. But I suppose that only pays with a very large table and a high number of buckets.

Disadvantages of the technique:

  • It can happen that you don't get as many rows as you want and you'd have to query again then.
  • You must choose the modulo number wisely. If there are just two thousand records in the table, you wouldn't make 1000 buckets of course, but maybe 100 and never demand more than, say, ten rows at a time.
  • If the table grows and grows, a once chosen number may no longer be optimal and you might want to alter it.

Rextester link: http://rextester.com/VDPIU7354

UPDATE: It just dawned on me that the buckets would be really random, if the generated column would not be based on a modulo on the ID, but on a RANDvalue instead:

alter table mytable add column bucket int generated always as (floor(rand() * 1000)) stored;

but MySQL throws an error "Expression of generated column 'bucket' contains a disallowed function". This doesn't seem to make sense, as a non-deterministic function should be okay with the STORED option, but at least in version 5.7.12 this doesn't work. Maybe in some later version?

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73