Any flaws in this method of selecting random records from the db?

Question

I have a db table filled with about ~30k records.

I want to randomly select a record one at a time (when demanded by users), delete the record from the table, and insert it in another table.

I've heard/found that doing ORDER BY RAND() can be quite slow. So I'm using this algorithm (pseudo code):

lowest = getLowestId(); //get lowest primary key id from table
highest = getHighestId(); //get highest primary key id from table

do
{
    id = rand(lowest, highest); //get random number between a range of lowest id and highest id
    idExists = checkIfRandomIdExists( id );
}
while (! idExists);

row = getRow (id);
process(row);
delete(id);

Right now, with 30k records, I seem to get random ids very quickly. However as the table size decreases to 15k, 10k, 5k, 100, etc, (can be months) I'm concerned that this might begin to get slower.

Can I do anything to make this method more effective, or is there a row count at which point I should start doing ORDER BY RAND() instead of this method? (e.g when 5k rows are left, start doing ORDER BY RAND() ?)

When dealing with random numbers, it's usually best not to iterate. Instead, try to get an array full of all possible IDs, and randomly select from that. — Madara's Ghost, May 11 '12 at 21:05
possible duplicate of [Linq Orderby random](http://stackoverflow.com/questions/3339192/linq-orderby-random) — Ian Mercer, May 11 '12 at 21:08

Dan · Answer 1 · 2012-05-11T21:10:32.127

3

You could get a random ID using that method, but instead of checking to see if it exists, just try and get the closest one?

SELECT * FROM table WHERE id >= $randomId ORDER BY id LIMIT 0,1

Then if that fails go for a lower one.

edited May 11 '12 at 21:10

answered May 11 '12 at 21:08

Dan

379
2
9

That will only be uniform if the id's can be guaranteed to be sequential *(which is rare, unless no rows are **ever** deleted...)* – BlueRaja - Danny Pflughoeft May 11 '12 at 21:31
1

This would be efficient, however it doesn't select the IDs in a uniform fashion (if that is important to the OP). When there is a larger gap between IDs, the larger ID has a higher probability of being selected. Look at the extreme example with two IDs in the table of values 1 and 100. This method will choose ID 100 99% of the time (instead of 50% as a uniform selection method would). – Mark Wilkins May 11 '12 at 21:31
what will `limit 0` do in this context? – Ali May 11 '12 at 21:49
I suppose it could be evened out by generating an operator in the code outside. Something like: `operator = rand(0, 1) ? '>=' : '<=';`, then stick that in the query, and do the opposite if it fails. – Dan May 14 '12 at 21:07

score 3 · Answer 2 · answered May 11 '12 at 21:11

3

One way to do it might be to determine number of records and choose by record:

select floor(count(*) * rand()) from thetable;

Use the resulting record number (e.g., chosenrec) in the limit:

select * from thetable limit chosenrec, 1;

answered May 11 '12 at 21:11

Mark Wilkins

40,729
5
57
110

Can you do sub-queries in a `limit` expression in MySQL? `select * from thetable limit (select floor(count(*) * rand()) from thetable), 1;` – BlueRaja - Danny Pflughoeft May 11 '12 at 22:04
@BlueRaja-DannyPflughoeft: I'm not sure; I actually tried that before posting the answer because it would make the solution much nicer. But I failed to come up with the proper syntax (if it is allowed). – Mark Wilkins May 11 '12 at 22:06

score 2 · Accepted Answer · answered May 11 '12 at 21:13

2

I might recommend a Fisher-Yates Shuffle instead in a separate table. To generate this, create a table like:

CREATE TABLE Shuffle
(
    SequentialId INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    OtherTableId INT NOT NULL
)

Notably, don't bother with the foreign key constraint. In SQL Server, for instance, I would say to add a foreign key constraint with ON DELETE CASCADE; if you have a storage engine for which that would be workable in MySQL, go for it.

Now, in the language of your choice:

Get an array of all the IDs in the other table (as @Truth suggested in comments).
Shuffle these ids using Fisher-Yates (takes linear time).
Insert them into the Shuffle table in order.

Now, you have a random order, so you can just INNER JOIN to the Shuffle table, then ORDER BY Shuffle.SequentialId to find the first record. You can delete the record from Shuffle manually if you have no way to do ON DELETE CASCADE.

answered May 11 '12 at 21:13

Andrew

14,325
4
43
64

Thanks, a very elegant solution. – Ali May 11 '12 at 21:33
That would take even more resources . If it's a permanent table , the oder will allways be the same . If it's a memory table i don't see the point of doing that (the owner of the question sugested same thing in a server side language ) . Anyway in my opinion it's not good to mix app logic with db logic . Just let the dbms take care of things . – Cata Cata May 11 '12 at 22:05
1

@CataCata More resources than what? In this case, it appears to be fine if the random order is created once. This solution only orders records that actually exist, then deletes the Shuffle records when the corresponding records in the other table are deleted. If the problem were instead to select one random record at a time, and the same record could be selected multiple times, this solution would be inappropriate. – Andrew May 11 '12 at 22:11

Any flaws in this method of selecting random records from the db?

3 Answers3