52

What is a fast way to select a random row from a large mysql table?

I'm working in php, but I'm interested in any solution even if it's in another language.

caitriona
  • 8,569
  • 4
  • 32
  • 36
lajos
  • 25,525
  • 19
  • 65
  • 75
  • 2
    possible duplicate of [MySQL select 10 random rows from 600K rows fast](http://stackoverflow.com/questions/4329396/mysql-select-10-random-rows-from-600k-rows-fast) – Peter O. Sep 15 '14 at 10:58

24 Answers24

53

Grab all the id's, pick a random one from it, and retrieve the full row.

If you know the id's are sequential without holes, you can just grab the max and calculate a random id.

If there are holes here and there but mostly sequential values, and you don't care about a slightly skewed randomness, grab the max value, calculate an id, and select the first row with an id equal to or above the one you calculated. The reason for the skewing is that id's following such holes will have a higher chance of being picked than ones that follow another id.

If you order by random, you're going to have a terrible table-scan on your hands, and the word quick doesn't apply to such a solution.

Don't do that, nor should you order by a GUID, it has the same problem.

Lasse V. Karlsen
  • 380,855
  • 102
  • 628
  • 825
  • As of now, `ORDER BY RAND()` only takes a few seconds for a table that has a million records. – wbzy00 Dec 10 '22 at 01:26
40

I knew there had to be a way to do it in a single query in a fast way. And here it is:

A fast way without involvement of external code, kudos to

http://jan.kneschke.de/projects/mysql/order-by-rand/

SELECT name
  FROM random AS r1 JOIN
       (SELECT (RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
 WHERE r1.id >= r2.id
 ORDER BY r1.id ASC
 LIMIT 1;
Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373
  • 2
    Note the tradeoff here in that, to be assured of getting a result on the first try, any keys which are preceded by gaps will be more likely to be selected. e.g., Given two records with keys 1 and 10, the record with 10 as its key will be selected 90% of the time. – Dave Sherohman Oct 17 '08 at 10:49
  • 1
    Yes, you can get a better distribution if the keys are without gaps and avoiding the WHERE and ORDER BY clauses. Check the article, it's all pretty well explained there. I didn't want to steal all of it, thus didn't put the other queries, pros and cons of each. – Vinko Vrsalovic Oct 17 '08 at 12:21
  • 1
    This query somehow does not return data at some time when you specify some extra parameter like WHERE r1.id >= r2.id AND r1.some_field=1 while some_field contains data=1. Any idea about how to solve this? – lomse Feb 18 '15 at 13:09
31

MediaWiki uses an interesting trick (for Wikipedia's Special:Random feature): the table with the articles has an extra column with a random number (generated when the article is created). To get a random article, generate a random number and get the article with the next larger or smaller (don't recall which) value in the random number column. With an index, this can be very fast. (And MediaWiki is written in PHP and developed for MySQL.)

This approach can cause a problem if the resulting numbers are badly distributed; IIRC, this has been fixed on MediaWiki, so if you decide to do it this way you should take a look at the code to see how it's currently done (probably they periodically regenerate the random number column).

CesarB
  • 43,947
  • 7
  • 63
  • 86
  • 1
    This is a beautiful idea. Is there an article or other resource detailing this? – Agnel Kurian May 10 '11 at 19:22
  • its nice idea but for N desired results may not work i guess.Because you might get less results or order might be tha same. – GorillaApe Oct 21 '12 at 19:25
  • It's a nice idea. But on the query we still have to sort by the random column, right? Suppose the random column is random_number, then the query is like: "SELECT * FROM mytable WHERE random_number>$rand ORDER BY random_number LIMIT 1". Is it much faster than ORDER BY RAND()? – haibuihoang Apr 21 '14 at 14:20
  • You'd need to place a degree of limitation on the maximum your random number with regard to the current number of entries. Then progress this limit with a degree of correlation to the number of rows in the table as it grows. Example is when there are not many entries. Say you have 3. Without a limit on the random number you can have say 2 very small number and one large one. The smallest of the 3 will almost never be called up when the gap between the min, itself, and the middle number is so small. What if min=0, max=100 with 3 entries & rand #'s assigned was 49, 50, 51? – 1'' Jun 11 '15 at 00:46
  • I don't understand this. How is this different from just randomizing a number between 1 and max(id) and pick the entry with that ID? Why do you need an extra column? – user2906759 Sep 01 '15 at 07:42
  • This is a really great idea, love it! – Kai Burghardt Jan 31 '17 at 16:08
  • Create an index on the column containing the random number. Then `SELECT id FROM table WHERE randNumb >= RAND() ORDER BY randNumb LIMIT 5` will be index-accelerated. If you're using MyISAM you may need a compound index on `(randNumb, id)`. – O. Jones May 05 '17 at 20:36
  • er. this is _extremely_ biased. consider the initial set with three pages: one random ended up at 0.1, another at 0.3, and the last at 0.9. this means that one has a 0.1 chance; one has a 0.2 chance; one has a 0.6 chance; and the last 0.1 chance is distributed however it handles the top-line loss. best case it's 6x distorted; worst it's 7x. "but it'll even out over time!" a look at the index shows that after almost two decades as one of the largest sites on earth, it hasn't. it actually gets *worse* over time as near-matches shave other pages out of existence. terrible idea – John Haugeland Aug 06 '17 at 17:36
  • This will give you rows according to some non-uniform distribution which itself has been sampled uniformly from the set of all possible distributions. That is, if you fetch a single article, from your perspective, it will be completely random (even if you add some WHERE condition, which is not possible with the `rand()*max(id)` trick). If you fetch articles multiple times, their cross-correlations will be nothing like the cross-correlation from two unformly random picks (e.g. the chance of getting the same article twice will be much larger). Depends on your use case whether that is a problem. – Tgr Sep 27 '21 at 03:53
14

Here's a solution that runs fairly quickly, and it gets a better random distribution without depending on id values being contiguous or starting at 1.

SET @r := (SELECT ROUND(RAND() * (SELECT COUNT(*) FROM mytable)));
SET @sql := CONCAT('SELECT * FROM mytable LIMIT ', @r, ', 1');
PREPARE stmt1 FROM @sql;
EXECUTE stmt1;
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • 1
    How do you get the row returned by this SQL query using PHP? Setting `$query` equal to the above and then doing the usual `mysql_query($query)` is not returning any results. Thanks. – ProgrammerGirl Apr 11 '12 at 23:00
  • 1
    That's 1.5 table scans -- 1 for the `COUNT(*)` (assuming InnoDB), something less than a full scan for the `OFFSET @r`. But it is excellent at being random and not depending on the properties of an id. – Rick James Aug 06 '15 at 21:41
  • @RickJames, Right. Another solution would be to enumerate the rows with a new column that is filled with serial integers. Then one can get the greatest with MAX() instead of COUNT(), and then choose it by index without coping with gaps. Though that solution requires renumbering as rows come and go. – Bill Karwin Aug 07 '15 at 02:29
4

Maybe you could do something like:

SELECT * FROM table 
  WHERE id=
    (FLOOR(RAND() * 
           (SELECT COUNT(*) FROM table)
          )
    );

This is assuming your ID numbers are all sequential with no gaps.

davr
  • 18,877
  • 17
  • 76
  • 99
3

Add a column containing a calculated random value to each row, and use that in the ordering clause, limiting to one result upon selection. This works out faster than having the table scan that ORDER BY RANDOM() causes.

Update: You still need to calculate some random value prior to issuing the SELECT statement upon retrieval, of course, e.g.

SELECT * FROM `foo` WHERE `foo_rand` >= {some random value} LIMIT 1
Rob
  • 47,999
  • 5
  • 74
  • 91
  • I thought about that. Add a new indexed column and on row creation, assign a random int to it. But the problem with that is I'm storing unnecessary data and you would still have to do something else to actually get a random row out of it, since the random column data is static. – David Sep 26 '08 at 22:31
  • 1
    How come this is -2, yet Cesar B's one is +17? They seem pretty much the same to me. – Jarrod Mosen May 13 '12 at 22:24
  • Should it be "SELECT * FROM `foo` WHERE `foo_rand` >= {some random value} ORDER BY foo_rand LIMIT 1"? – haibuihoang Apr 21 '14 at 14:22
  • What if your {some random value} is greater than the highest pre-generated random number in the table though. You'll return an empty recordset. – Codemonkey Jun 25 '18 at 12:19
1

There is another way to produce random rows using only a query and without order by rand(). It involves User Defined Variables. See how to produce random rows from a table

Ilan Hazan
  • 11
  • 1
1

I see here a lot of solution. One or two seems ok but other solutions have some constraints. But the following solution will work for all situation

select a.* from random_data a, (select max(id)*rand() randid  from random_data) b
     where a.id >= b.randid limit 1;

Here, id, don't need to be sequential. It could be any primary key/unique/auto increment column. Please see the following Fastest way to select a random row from a big MySQL table

Thanks Zillur - www.techinfobest.com

Jhanvi
  • 5,069
  • 8
  • 32
  • 41
Zillur
  • 11
  • 2
1

if you don't delete row in this table, the most efficient way is:

(if you know the mininum id just skip it)

SELECT MIN(id) AS minId, MAX(id) AS maxId FROM table WHERE 1

$randId=mt_rand((int)$row['minId'], (int)$row['maxId']);

SELECT id,name,... FROM table WHERE id=$randId LIMIT 1
parm.95
  • 13
  • 4
1

In order to find random rows from a table, don’t use ORDER BY RAND() because it forces MySQL to do a full file sort and only then to retrieve the limit rows number required. In order to avoid this full file sort, use the RAND() function only at the where clause. It will stop as soon as it reaches to the required number of rows. See http://www.rndblog.com/how-to-select-random-rows-in-mysql/

Sagi Bron
  • 27
  • 3
1

For selecting multiple random rows from a given table (say 'words'), our team came up with this beauty:

SELECT * FROM
`words` AS r1 JOIN 
(SELECT  MAX(`WordID`) as wid_c FROM `words`) as tmp1
WHERE r1.WordID >= (SELECT (RAND() * tmp1.wid_c) AS id) LIMIT n
Marek Sebera
  • 39,650
  • 37
  • 158
  • 244
0

I ran into the problem where my IDs were not sequential. What I came up with this.

SELECT * FROM products WHERE RAND()<=(5/(SELECT COUNT(*) FROM products)) LIMIT 1

The rows returned are approximately 5, but I limit it to 1.

If you want to add another WHERE clause it becomes a bit more interesting. Say you want to search for products on discount.

SELECT * FROM products WHERE RAND()<=(100/(SELECT COUNT(*) FROM pt_products)) AND discount<.2 LIMIT 1

What you have to do is make sure you are returning enough result which is why I have it set to 100. Having a WHERE discount<.2 clause in the subquery was 10x slower, so it's better to return more results and limit.

bikedorkseattle
  • 951
  • 3
  • 10
  • 24
0

The classic "SELECT id FROM table ORDER BY RAND() LIMIT 1" is actually OK.

See the follow excerpt from the MySQL manual:

If you use LIMIT row_count with ORDER BY, MySQL ends the sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result.

igelkott
  • 1,287
  • 8
  • 9
  • 2
    But it still has to assign a random number to each and every record, doesn't it? I ask because that explanation doesn't make much sense to me: how it is going to return first N sorted rows if the whole resultset is not sorted :S – Damir Zekić Oct 25 '08 at 23:46
  • @igelkott, there's still performance issue, I guess it's not OK – Unreality Nov 03 '09 at 02:26
0

An easy but slow way would be (good for smallish tables)

SELECT * from TABLE order by RAND() LIMIT 1
Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373
  • 3
    This will produce a random value for all the rows in the table, a sort, and then grabbing one row. This is not quick. – Lasse V. Karlsen Oct 17 '08 at 07:43
  • 1
    True. It's quick in development time though. (and in answer time :-) ). I'll leave it here for non big table users who might need it – Vinko Vrsalovic Oct 17 '08 at 07:49
  • "smallish" can be surprisingly small (I've run into problems with a 20k entry table on a virtual host), and tracking this kind of problem down can be a *royal* pain in the back. Do yourself a favour and use a proper algorithm from the start. – Creshal Jun 21 '13 at 12:03
  • This is going to cause a big performance drain for large tables. Check this similar question http://stackoverflow.com/questions/1244555/how-can-i-optimize-mysqls-order-by-rand-function – iankit Jan 27 '14 at 06:26
0

With a order yo will do a full scan table. Its best if you do a select count(*) and later get a random row=rownum between 0 and the last registry

MazarD
  • 2,759
  • 2
  • 22
  • 33
0

In pseudo code:

sql "select id from table"
store result in list
n = random(size of list)
sql "select * from table where id=" + list[n]

This assumes that id is a unique (primary) key.

Anders Sandvig
  • 20,720
  • 16
  • 59
  • 73
0

Use the below query to get the random row

SELECT user_firstname ,
COUNT(DISTINCT usr_fk_id) cnt
FROM userdetails 
GROUP BY usr_fk_id 
ORDER BY cnt ASC  
LIMIT 1
Dane I
  • 742
  • 9
  • 22
MANOJ
  • 177
  • 5
  • 14
0

In my case my table has an id as primary key, auto-increment with no gaps, so I can use COUNT(*) or MAX(id) to get the number of rows.

I made this script to test the fastest operation:

logTime();
query("SELECT COUNT(id) FROM tbl");
logTime();
query("SELECT MAX(id) FROM tbl");
logTime();
query("SELECT id FROM tbl ORDER BY id DESC LIMIT 1");
logTime();

The results are:

  • Count: 36.8418693542479 ms
  • Max: 0.241041183472 ms
  • Order: 0.216960906982 ms

Answer with the order method:

SELECT FLOOR(RAND() * (
    SELECT id FROM tbl ORDER BY id DESC LIMIT 1
)) n FROM tbl LIMIT 1

...
SELECT * FROM tbl WHERE id = $result;
António Almeida
  • 9,620
  • 8
  • 59
  • 66
0

I have used this and the job was done the reference from here

SELECT * FROM myTable WHERE RAND()<(SELECT ((30/COUNT(*))*10) FROM myTable) ORDER BY RAND() LIMIT 30;
Yousef Altaf
  • 2,631
  • 4
  • 46
  • 71
0

I'm a bit new to SQL but how about generating a random number in PHP and using

SELECT * FROM the_table WHERE primary_key >= $randNr

this doesn't solve the problem with holes in the table.

But here's a twist on lassevks suggestion:

SELECT primary_key FROM the_table

Use mysql_num_rows() in PHP create a random number based on the above result:

SELECT * FROM the_table WHERE primary_key = rand_number

On a side note just how slow is SELECT * FROM the_table:
Creating a random number based on mysql_num_rows() and then moving the data pointer to that point mysql_data_seek(). Just how slow will this be on large tables with say a million rows?

LPL
  • 16,827
  • 6
  • 51
  • 95
0

Create a Function to do this most likely the best answer and most fastest answer here!

Pros - Works even with Gaps and extremely fast.

<?

$sqlConnect = mysqli_connect('localhost','username','password','database');

function rando($data,$find,$max = '0'){
   global $sqlConnect; // Set as mysqli connection variable, fetches variable outside of function set as GLOBAL
   if($data == 's1'){
     $query = mysqli_query($sqlConnect, "SELECT * FROM `yourtable` ORDER BY `id` DESC LIMIT {$find},1");

     $fetched_data = mysqli_fetch_assoc($query);
      if(mysqli_num_rows($fetched_data>0){
       return $fetch_$data;
      }else{
       rando('','',$max); // Start Over the results returned nothing
      }
   }else{
     if($max != '0'){
        $irand = rand(0,$max); 
        rando('s1',$irand,$max); // Start rando with new random ID to fetch
     }else{

        $query = mysqli_query($sqlConnect, "SELECT `id` FROM `yourtable` ORDER BY `id` DESC LIMIT 0,1");
        $fetched_data = mysqli_fetch_assoc($query);
        $max = $fetched_data['id'];
        $irand = rand(1,$max);
        rando('s1',$irand,$max); // Runs rando against the random ID we have selected if data exist will return
     }
   }
 }

 $your_data = rando(); // Returns listing data for a random entry as a ASSOC ARRAY
?>

Please keep in mind this code as not been tested but is a working concept to return random entries even with gaps.. As long as the gaps are not huge enough to cause a load time issue.

0

Take a look at this link by Jan Kneschke or this SO answer as they both discuss the same question. The SO answer goes over various options also and has some good suggestions depending on your needs. Jan goes over all the various options and the performance characteristics of each. He ends up with the following for the most optimized method by which to do this within a MySQL select:

SELECT name
  FROM random AS r1 JOIN
       (SELECT (RAND() *
                     (SELECT MAX(id)
                        FROM random)) AS id)
        AS r2
 WHERE r1.id >= r2.id
 ORDER BY r1.id ASC
 LIMIT 1;

HTH,

-Dipin

Community
  • 1
  • 1
Dipin
  • 1,174
  • 9
  • 8
-1

Quick and dirty method:

SET @COUNTER=SELECT COUNT(*) FROM your_table;

SELECT PrimaryKey
FROM your_table
LIMIT 1 OFFSET (RAND() * @COUNTER);

The complexity of the first query is O(1) for MyISAM tables.

The second query accompanies a table full scan. Complexity = O(n)

Dirty and quick method:

Keep a separate table for this purpose only. You should also insert the same rows to this table whenever inserting to the original table. Assumption: No DELETEs.

CREATE TABLE Aux(
  MyPK INT AUTO_INCREMENT,
  PrimaryKey INT
);

SET @MaxPK = (SELECT MAX(MyPK) FROM Aux);
SET @RandPK = CAST(RANDOM() * @MaxPK, INT)
SET @PrimaryKey = (SELECT PrimaryKey FROM Aux WHERE MyPK = @RandPK);

If DELETEs are allowed,

SET @delta = CAST(@RandPK/10, INT);

SET @PrimaryKey = (SELECT PrimaryKey
                   FROM Aux
                   WHERE MyPK BETWEEN @RandPK - @delta AND @RandPK + @delta
                   LIMIT 1);

The overall complexity is O(1).

Community
  • 1
  • 1
yogman
  • 4,021
  • 4
  • 24
  • 27
-2

SELECT DISTINCT * FROM yourTable WHERE 4 = 4 LIMIT 1;

JNYRanger
  • 6,829
  • 12
  • 53
  • 81
proggL
  • 1
  • 1