1

I have heard that I should avoid using 'order by rand()', but I really need to use it. Unlike what I have been hearing, the following query comes up very fast.

select 

    cp1.img_id as left_id,
    cp1.img_filename as left_filename,
    cp1.facebook_name as left_facebook_name,
    cp2.img_id as right_id, 
    cp2.img_filename as right_filename,
    cp2.facebook_name as right_facebook_name    

from 
challenge_photos as cp1
cross join 
challenge_photos as cp2
where
(cp1.img_id < cp2.img_id) 
and 
(cp1.img_id,cp2.img_id) not in ((0,0)) 
and 
(cp1.img_status = 1 and cp2.img_status = 1) 
order by rand() limit 1

is this query considered 'okay'? or should I use queries that I can find by searching "alternative to rand()" ?

Moon
  • 22,195
  • 68
  • 188
  • 269
  • 1
    I'm sorry to say I don't know the specific answer to your question, but here's some info you might or might not know: The big reason why `ORDER BY RAND()` is discouraged is because such a query requires the SQL server to generate a random number for each row in your resultset (disregarding the limit) and sort only to take the smallest. – 3Doubloons Mar 30 '11 at 03:58
  • 'very fast' for how many records? 100? 100K? – David Gelhar Mar 30 '11 at 04:02
  • Alex Brault // yes that is what I thought, but I am a little bit confused because the query comes up pretty fast. – Moon Mar 30 '11 at 04:02
  • @David Gelhar // for now, 10K... – Moon Mar 30 '11 at 04:03
  • See [Rand() query and performance](http://stackoverflow.com/questions/5445055/rand-query-and-performance) for some alternatives. – David Gelhar Mar 30 '11 at 04:09

1 Answers1

2

It's usually a performance thing. You should avoid, as much as possible, per-row functions since they slow down your queries.

That means things like uppercase(name), salary * 1.1 and so on. It also includes rand(). It may not be an immediate problem (at 10,000 rows) but, if you ever want your database to scale, you should keep it in mind.

The two main issues are the fact that you're performing a per-row function and then having to do a full sort on the output before selecting the first row. The DBMS cannot use an index if you sort on a random value.

But, if you need to do it (and I'm not making judgement calls there), then you need to do it. Pragmatism often overcomes dogmatism in the real world :-)


A possibility, if performance ever becomes an issue, is to get a count of the records with something like:

select count(*) from ...

then choose a random value on the client side and use a:

limit <start>, <count>

clause in another select, adjusting for the syntax used by your particular DBMS. This should remove the sorting issue and the transmission of unneeded data across the wire.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953