5

I'm looking for an efficient way of randomly selecting 100 rows satisfying certain conditions from a MySQL table with potentially millions of rows.

Almost everything I've found suggests avoiding the use of ORDER BY RAND(), because of poor performance and scalability.

However, this article suggests ORDER BY RAND() may still be used as a "nice and fast way" to fetch randow data.

Based on this article, below is some example code showing what I'm trying to accomplish. My questions are:

  1. Is this an efficient way of randomly selecting 100 (or up to several hundred) rows from a table with potentially millions of rows?

  2. When will performance become an issue?

    SELECT  user.* 
    FROM    (
            SELECT  id 
            FROM    user 
            WHERE   is_active = 1
            AND     deleted = 0
            AND     expiretime > '.time().'
            AND     id NOT IN (10, 13, 15)
            AND     id NOT IN (20, 30, 50)
            AND     id NOT IN (103, 140, 250)
        ORDER BY    RAND() 
            LIMIT   100
            ) 
            AS      random_users
    STRAIGHT JOIN   user
    ON      user.id = random_users.id
Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
  • It makes sense to select random values on a field with index. – Kayser Mar 28 '12 at 16:52
  • @Kayser, I'm concerned that we still have to scan ALL rows for the WHERE conditions. Is that going to affect performance with a huge table (potentially millions of rows)? – user1298692 Mar 28 '12 at 17:05
  • The method with the pk-subselect is likely to reduce only marginally the execution time. That's because with or without this technique, rand() is called for all matching rows, and the number of rows to sort is the same. Presumably this is interesting if "user" has lots of columns, or large columns in size, and mysql is not clever enough to wait after the LIMIT takes place to materialize user.* (that should be tested). – Daniel Vérité Mar 28 '12 at 19:49
  • Also read http://stackoverflow.com/questions/249301/simple-random-samples-from-a-mysql-database – Daniel Vérité Mar 28 '12 at 19:52

3 Answers3

1

Is strongly urge you to read this article. The last segment will be covering the selection of multiple random row. And you should be able to notice the SELECT statement in the PROCEDURE that will be described there. That would be the spot where you add your specific WHERE conditions.

The problem with ORDER BY RAND() is that this operation has complexity of n*log2(n), while the method described in the article that I linked, has almost constant complexity.

Lets assume, that selecting random row from table, which contains 10 entries, using ORDER BY RAND() takes 1 time unit:

  entries  |  time units
-------------------------
       10  |         1     /* if this takes 0.001s */
      100  |        20
    1'000  |       300
   10'000  |     4'000
  100'000  |    50'000
1'000'000  |   600'000     /* then this will need 10 minutes */

And you wrote that you are dealing with table on scale of millions.

NikiC
  • 100,734
  • 37
  • 191
  • 225
tereško
  • 58,060
  • 25
  • 98
  • 150
0

Preprocess as much as possible try something like (VB-like example)

Dim sRND = New StringBuilder : Dim iRandom As New Random()
Dim iMaxID As Integer = **put you maxId here**
Dim Cnt as Integer=0
While Cnt < 100
      Dim RndVal As Integer = iRandom.Next(1, iMaxID)
      If Not ("10,13,15,20,30,50,103,140,250").Contains(RndVal) Then
            Cnt += 1
            sRND.Append("," & RndVal)
      end if
End While
String.Format("SELECT * FROM (Select ID FROM(User) WHERE(is_active = 1) AND deleted = 0 AND expiretime > {0} AND id IN ({1}) .blahblablah.... LIMIT 100",time(), Mid(sRND.ToString, 2))

I didn't check for syntax but you'll get my drift I hope. This will make MySql read records that fit the 'IN' and stop when it reaches 100 without the need to preprocess all records first.

Please let me know the elapsedtime difference if you try it. (I'm qurious)

Ton
  • 316
  • 2
  • 12
0

I'm afraid no-one's going to be able to answer your question with any accuracy. If you really want to know you'll need to run some benchmarks against your system (not the live one ideally but an exact copy). Benchmark this solution against a different solution (getting the random rows using PHP for example) and compare the numbers to what you/your client consider "good performance). Then ramp up your data trying to keep the distribution of column values as close to real as you can and see where performance starts to drop off. To be honest if it works for you now with a bit of headroom, then I'd go for it. When (if!) it becomes a bottleneck then you can look at it again - or just chuck extra iron at your database...

liquorvicar
  • 6,081
  • 1
  • 16
  • 21