3

I'm trying to grab a random row from a table where the data doesn't change. I've read that people try ORDER BY RAND() which is terrible for large datasets and doesn't scale well.

I've also seen the solution that is to get SQL to get a random row between a minimum/maximum range like so: FLOOR(MAX(needed_id) * RAND) but this would only work when the rows are sequential: 1,2,3,4,5,6,7,8,9,10.

The data I need to pull out isn't sequential, for example: 1,2,3,4,10,11,12,13

So I'm thinking there are two solutions:

1st Solution: Keep running this: FLOOR(MAX(needed_id) * RAND) until I receive a row of the right type (1/6 chance)

2nd Solution: Create a duplicate table (as my data never changes) like so:

temp_id | needed_id | type 
1            1          1
2            4          1
3            7          2
3            8          2

So I can pull out a random temp_id using this method: FLOOR(MAX(temp_id) * RAND) - WHERE type = 1

What do you think? I'm likely to run the 1st solution about 6 times until I receive the correct row, but in the 2nd solution it would work straight away but requires another table.

creamcheese
  • 2,524
  • 3
  • 29
  • 55

3 Answers3

4

Your statement

but this would only work when the rows are sequential:

is not compeletely correct: The floor() and max() examples do work on non-sequential rows, because you would do somehting like

WHERE id >= FLOOR(RAND()*MAX(id)) LIMIT 1

So you take the closest ID to the random hit you're getting.

This does have a slight preference for hits that are directly after a big gap in your sequence, but that might not be too bad, depending on your dataset.

So depending on how much problem you'd have with this slight preference, how your dataset is, etc etc this could still be the best sollution.

Because it is unclear to some, the usage of the functions is not a problem:

MAX is quick on an indexed field. You don't need to count all the rows (slow on innoDB), you just need to traverse your BTREE index, so you'll find this value in log time. This is near-instant

FLOOR is just a mathematical function which will perform in linear time. Just as RAND. Mind you that ORDER BY rand() is not slow because of rand, but because you need to order the complete table! this is not a problem of rand, but of order.

Now you have a query that does something like:

WHERE id >= 48 LIMIT 1

Which is ofcourse very quick on an indexed field. Remember that you got that 48 (an example) not by doing any sort of table scan.

Nanne
  • 64,065
  • 16
  • 119
  • 163
  • Hi, sorry I guess it's my fault that I didn't quite specify what I meant. I need rows that meet a certain criteria (type=1 for instance). So when it grabs a random number between 1 to 250,000 rows, It may pick a random row that doesn't have the type 1. Although I guess you're right... it would pick the next thing closest to what I'm after :) – creamcheese Jun 19 '11 at 11:39
  • 3 functions used, and RAND() still not removed from the query. I don't think this query will be much faster. – OZ_ Jun 19 '11 at 11:42
  • Remember that you could just add the `WHERE type = 1`. This would only make your resultset smaller (so the caps bigger), but it will still work, and you don't need to re-do you mutliple times! – Nanne Jun 19 '11 at 11:43
  • one thing is ORDER BY rand(). Another is calling RAND() like that. – dynamic Jun 19 '11 at 11:44
  • @OZ_ The functions aren't slow at all, those are lightening quick. `RAND` and `FLOOR` are linear and `MAX` on on indexed row is will perform in (guessing) something like `log` time. The problem with rand is when you use `ORDER BY rand()`, because you have to do a full table scan. You don't in this query. You'll get a random number and then find the closest id (indexed) for that number. – Nanne Jun 19 '11 at 11:46
1

$cnt = count of rows. This value can be cached (and it's very recommended if you work with InnoDB).

$rnd = mt_rand(0,$cnt);

Query:

SELECT * FROM `table` WHERE `where_cond`='some_value' LIMIT $rnd,1

Of course, you can select any value with any where-clause, all trick in LIMIT $rnd, 1 part.
I like this method because no any JOINs here.
Also, this method can be used with sequenced and non-sequenced rows, even without ID.

OZ_
  • 12,492
  • 7
  • 50
  • 68
  • 2 queries instead of one? (the `$cnt` needs a query) Not good sorry. – dynamic Jun 19 '11 at 11:11
  • @yes123, LOL. 1) count(*) query works VERY VERY fast; 2) count value can be and should be cached, especially when `data doesn't change`. – OZ_ Jun 19 '11 at 11:13
  • COUNT(*) without any where conditions is only blazingly fast on MyIsam, for InnoDB it's not as fast as you would think – Tobias P. Jun 19 '11 at 11:22
  • @Tobias P., anyway it's much faster than RAND() even in InnoDB. And this value can (and should) be cached, so just forget about this second query. – OZ_ Jun 19 '11 at 11:28
  • @OZ_ I did edited your answer just to put `$cnt` in code tag. And you have rolledback. Are you a child or? – dynamic Jun 19 '11 at 11:29
  • @yes123 you can see my age in profile. You downvoted answer just because I rolled back your silly edit? Are you kid or just selfish? – OZ_ Jun 19 '11 at 11:32
  • @yes123, it's a very good solution. Not so elegant, but mathematically much much better than @Nanne's solution. – Karolis Jun 19 '11 at 11:43
  • @karolis: please explain. mine works with gaps in the sequence, does not need to count the rows and cache that value. It will do some math functions quickly, and then find a row based on an index. Much quicker. – Nanne Jun 19 '11 at 11:48
  • @Nanne but this method will work when table hasn't ID row (and MAX(id) can't be calculated). – OZ_ Jun 19 '11 at 11:56
  • @Nanne I saw you know the problem, this is the gaps. One gap doubles the preference of the next row. Two consecutive gaps triples the preference of the next row. As you can see it is not a slight problem, it's a huge impact for professional application from mathematical point of view. – Karolis Jun 19 '11 at 12:19
  • That depends on the location and possible changes of the caps. Yould also go into a rant about how `rand` is only pseudo-random, but I don't think that's a problem for any real-life situation. If you want a sollution "showing a random article on my homepage" i'd say the speed is pretty much more important then a preference. If you have 250K rows, a triple in preference will still be 1/250000 versus 3/250000, which does not have to be a problem. – Nanne Jun 19 '11 at 12:23
  • @Nanne From practical point of view if you have more rows, this means that you have more gaps. If you have 250K rows, usually this means that you have more rows with probability of 3/250000. The count of rows is not important in this case, the situation is almost the same. – Karolis Jun 19 '11 at 12:36
  • @Nanne By the way I am not saying that your decision absolutely wrong, NOT! I just wanted to say that this solution looks much better at least for me. – Karolis Jun 19 '11 at 12:40
1

You should read the following blog post of Jan Kneschke: ORDER BY RAND()

He lists a few possible solutions and their performance behaviour.

Tobias P.
  • 4,537
  • 2
  • 29
  • 36