19

Well, this is a very old question never gotten real solution. We want 3 random rows from a table with about 30k records. The table is not so big in point of view MySQL, but if it represents products of a store, it's representative. The random selection is useful when one presents 3 random products in a webpage for example. We would like a single SQL string solution that meets these conditions:

  1. In PHP, the recordset by PDO or MySQLi must have exactly 3 rows.
  2. They have to be obtained by a single MySQL query without Stored Procedure used.
  3. The solution must be quick as for example a busy apache2 server, MySQL query is in many situations the bottleneck. So it has to avoid temporary table creation, etc.
  4. The 3 records must be not contiguous, ie, they must not to be at the vicinity one to another.

The table has the following fields:

CREATE TABLE Products (
  ID INT(8) NOT NULL AUTO_INCREMENT,
  Name VARCHAR(255) default NULL,
  HasImages INT default 0,
  ...
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

The WHERE constraint is Products.HasImages=1 permitting to fetch only records that have images available to show on the webpage. About one-third of records meet the condition of HasImages=1.

Searching for a Perfection, we first let aside the existent Solutions that have drawbacks:


I. This basic solution using ORDER BY RAND(),

is too slow but guarantees 3 really random records at each query:

SELECT ID, Name FROM Products WHERE HasImages=1 ORDER BY RAND() LIMIT 3;

*CPU about 0.10s, scanning 9690 rows because of WHERE clause, Using where; Using temporary; Using filesort, on Debian Squeeze Double-Core Linux box, not so bad but

not so scalable to a bigger table as temporary table and filesort are used, and takes me 8.52s for the first query on the test Windows7::MySQL system. With such a poor performance, to avoid for a webpage isn't-it ?


II. The bright solution of riedsio using JOIN ... RAND(),

from MySQL select 10 random rows from 600K rows fast, adapted here is only valid for a single random record, as the following query results in an almost always contiguous records. In effect it gets only a random set of 3 continuous records in IDs:

SELECT Products.ID, Products.Name
FROM Products
INNER JOIN (SELECT (RAND() * (SELECT MAX(ID) FROM Products)) AS ID)
  AS t ON Products.ID >= t.ID
WHERE (Products.HasImages=1)
ORDER BY Products.ID ASC
LIMIT 3;

*CPU about 0.01 - 0.19s, scanning 3200, 9690, 12000 rows or so randomly, but mostly 9690 records, Using where.


III. The best solution seems the following with WHERE ... RAND(),

seen on MySQL select 10 random rows from 600K rows fast proposed by bernardo-siu:

SELECT Products.ID, Products.Name FROM Products
WHERE ((Products.Hasimages=1) AND RAND() < 16 * 3/30000) LIMIT 3;

*CPU about 0.01 - 0.03s, scanning 9690 rows, Using where.

Here 3 is the number of wished rows, 30000 is the RecordCount of the table Products, 16 is the experimental coefficient to enlarge the selection in order to warrant the 3 records selection. I don't know on what basis the factor 16 is an acceptable approximation.

We so get at the majority of cases 3 random records and it's very quick, but it's not warranted: sometimes the query returns only 2 rows, sometimes even no record at all.

The three above methods scan all records of the table meeting WHERE clause, here 9690 rows.

A better SQL String?

Community
  • 1
  • 1
jacouh
  • 8,473
  • 5
  • 32
  • 43
  • If the table is small, what's wrong with http://davidwalsh.name/mysql-random – Anthony Sep 22 '13 at 11:46
  • 3
    Not really the solution you are looking for, but to increase performance, you could have a cronjob choose 3 random products every 5 minutes and cache those in a file, then include that file in your pages. – Sumurai8 Sep 22 '13 at 11:46
  • @Anthony Read the whole question... – Sumurai8 Sep 22 '13 at 11:49
  • @Sumurai8 - I did. Maybe you could tell me what I didn't get... – Anthony Sep 22 '13 at 11:50
  • @Anthony OP's first solution is the query your article suggests. OP is concerned about the performance of that query and wants a query that runs faster, and reliably returns 3 records. – Sumurai8 Sep 22 '13 at 11:53
  • 2
    Oh I see. If it takes 8 seconds, that means there probably no index. A full table scan should not take that long for that row count. – Anthony Sep 22 '13 at 11:54
  • @Anthony, I'm impressed by the slowness of such a query launched first time on the Windows system. Although better for queries that follow - Good Sunday. – jacouh Sep 22 '13 at 12:11
  • @Sumurai8, the cronjob solution will offer 3 random rows for all queries during 5 min, at minimum 1min. But a good random selection permits different RecordSet at each query. – jacouh Sep 22 '13 at 12:24
  • 1
    maybe faster if you get 3 random ID's in php and then fetch rows with this ID's ? – joschua011 Sep 22 '13 at 12:29
  • @joschua011, that's doable if ID is continuous, and more difficult to satisfy is that if there is no WHERE constraints without additional MySQL queries, as PHP does not know which ID has images available. – jacouh Sep 22 '13 at 12:58
  • 1
    I tend to agree with @Sumurai8. Rather than whacking your products list table on every new request, then complain about performance, why don't you pre-load your random choices into a file or a secondary table. You could pre-fetch 20 or 100 combinations in advance (updated on a hourly/daily basis perhaps), and then randomly select from these options instead. – cartbeforehorse Sep 22 '13 at 16:36
  • @cartbeforehorse, It's more attractive to have a one single step solution rather than a two steps one. Nothing to do if you have a miraculous random SELECT SQL string. – jacouh Sep 22 '13 at 16:46
  • Another question that immediately springs to mind is whether you *really* want your featured products to be randomly selected. Surely there must be external reasons that you want some articles to be given a preferential billing. This may not be a valid comment as I don't have all the information about your site. However, there is something about this question that makes my hair stand on end, and leads me to wonder whether you're approaching the whole thing from the wrong direction. – cartbeforehorse Sep 22 '13 at 16:51
  • @jacouh. Why is it more attractive? If it delivers you the results you want and eliminates the performance issues, in what way is that less attractive? It may be more complicated to program, I grant you that. But if that's your only reasoning, then you're just being lazy. It's for situations like these that several RDBMS systems give you the option of creating TEMP or in-memory tables. – cartbeforehorse Sep 22 '13 at 16:55
  • @cartbeforehorse, sorry, my question is how to do, not why. – jacouh Sep 22 '13 at 16:56
  • @cartbeforehorse, I studied your last idea, it may be an alternative solution, but needs more mastering of the Database. – jacouh Sep 22 '13 at 17:41
  • 1
    Test results for Solution III: I am not retrieving random results. i'm having the first 10000 rows set to HasImages=0 (id 1 to 10000), and the proceding 20000 rows to HasImages=1 (id 10001 to 30000). I'm never receiving results for id's above 15000. – Tobias Strebitzer Sep 24 '13 at 10:39
  • @Tobias Strebitzer, IDs HasImages are normally distributed randomly, as we take a photo of an item as we receive it and then SET HasImages = 1. As we had no reflex to do it before. Solution III is far away from a universal solution. It must be adapted to number of .RecordCount and images density, especially the enlarging factor 16. – jacouh Sep 24 '13 at 10:51

6 Answers6

5

Ugly, but quick and random. Can become very ugly very fast, especially with tuning described below, so make sure you really want it this way.

(SELECT Products.ID, Products.Name
FROM Products
    INNER JOIN (SELECT RAND()*(SELECT MAX(ID) FROM Products) AS ID) AS t ON Products.ID >= t.ID
WHERE Products.HasImages=1
ORDER BY Products.ID
LIMIT 1)

UNION ALL

(SELECT Products.ID, Products.Name
FROM Products
    INNER JOIN (SELECT RAND()*(SELECT MAX(ID) FROM Products) AS ID) AS t ON Products.ID >= t.ID
WHERE Products.HasImages=1
ORDER BY Products.ID
LIMIT 1)

UNION ALL

(SELECT Products.ID, Products.Name
FROM Products
    INNER JOIN (SELECT RAND()*(SELECT MAX(ID) FROM Products) AS ID) AS t ON Products.ID >= t.ID
WHERE Products.HasImages=1
ORDER BY Products.ID
LIMIT 1)

First row appears more often than it should

If you have big gaps between IDs in your table, rows right after such gaps will have bigger chance to be fetched by this query. In some cases, they will appear significatnly more often than they should. This can not be solved in general, but there's a fix for a common particular case: when there's a gap between 0 and the first existing ID in a table.

Instead of subquery (SELECT RAND()*<max_id> AS ID) use something like (SELECT <min_id> + RAND()*(<max_id> - <min_id>) AS ID)

Remove duplicates

The query, if used as is, may return duplicate rows. It is possible to avoid that by using UNION instead of UNION ALL. This way duplicates will be merged, but the query no longer guarantees to return exactly 3 rows. You can work around that too, by fetching more rows than you need and limiting the outer result like this:

(SELECT ... LIMIT 1)
UNION (SELECT ... LIMIT 1)
UNION (SELECT ... LIMIT 1)
...
UNION (SELECT ... LIMIT 1)
LIMIT 3

There's still no guarantee that 3 rows will be fetched, though. It just makes it more likely.

Leo Nyx
  • 696
  • 3
  • 5
  • 2
    This solution assures really 3 times of one random row. The drawback is that the three ID's can be mathematically the same, eg, 10088, 10088, 10088. But this may never occur in the real world. It uses up CPU 0.01 to 0.20s. – jacouh Sep 22 '13 at 16:20
  • @jacouh do you have an index on `(HasImages)`? This should be much faster than 0.20seconds. – ypercubeᵀᴹ Sep 22 '13 at 23:18
  • @ypercube, yes this table has PK .ID and .HasImages is also indexed. – jacouh Sep 23 '13 at 06:46
  • i just tested this method, and i'm receiving a specific record significantly often (the lowest-id product with HasImages=1). If i change the filter to HasImages=0, i'm not receiving 3 rows most of the time. – Tobias Strebitzer Sep 24 '13 at 06:18
  • @TobiasStrebitzer Your IDs are not uniformly distributed. You have a big hole before your lowest ID. You could try to fix that by changing the way random ID is generated. Try: ` + RAND()*( - )`. – Leo Nyx Sep 24 '13 at 08:36
  • As days passed, no better solution has been proposed. I select this solution as accepted, more, SELECT MAX(ID) FROM Products can be cached as constant, ie, no MySQL query is needed. It'll be faster. More, three records can be garanteed. – jacouh Sep 30 '13 at 19:43
  • @LeoNyx I tested this same method for my problem but I returns same products twice. How can one stop such repetition? – OmniPotens Oct 04 '13 at 09:30
  • 1
    @OmniPotens I expanded the post, check it out. – Leo Nyx Oct 04 '13 at 12:44
2
SELECT Products.ID, Products.Name
FROM Products
INNER JOIN (SELECT (RAND() * (SELECT MAX(ID) FROM Products)) AS ID) AS t ON Products.ID     >= t.ID
WHERE (Products.HasImages=1)
ORDER BY Products.ID ASC
LIMIT 3;

Of course the above is given "near" contiguous records you are feeding it the same ID every time without much regard to the seed of the rand function.

This should give more "randomness"

SELECT Products.ID, Products.Name
FROM Products
INNER JOIN (SELECT (ROUND((RAND() * (max-min))+min)) AS ID) AS t ON Products.ID     >= t.ID
WHERE (Products.HasImages=1)
ORDER BY Products.ID ASC
LIMIT 3;

Where max and min are two values you choose, lets say for example sake:

max = select max(id)
min = 225
Woot4Moo
  • 23,987
  • 16
  • 94
  • 151
  • I mean that the query fetched 3 contiguous records. I tested your solution, it gets always continuous ID values such as: 10485, 10487, 10488. ID=10486 is filtered out as the corresponding product has no available images, ie, HasImages=0. – jacouh Sep 22 '13 at 14:12
1

This statement executes really fast (19 ms on a 30k records table):

$db = new PDO('mysql:host=localhost;dbname=database;charset=utf8', 'username', 'password');
$stmt = $db->query("SELECT p.ID, p.Name, p.HasImages
                    FROM (SELECT @count := COUNT(*) + 1, @limit := 3 FROM Products WHERE HasImages = 1) vars
                    STRAIGHT_JOIN (SELECT t.*, @limit := @limit - 1 FROM Products t WHERE t.HasImages = 1 AND (@count := @count -1) AND RAND() < @limit / @count) p");
$products = $stmt->fetchAll(PDO::FETCH_ASSOC);

The Idea is to "inject" a new column with randomized values, and then sort by this column. The generation of and sorting by this injected column is way faster than the "ORDER BY RAND()" command.

There "might" be one caveat: You have to include the WHERE query twice.

  • I would seriously test it. – jacouh Sep 22 '13 at 13:10
  • I ran it on Debian Squeeze (mysqld Ver 5.1.63-0+squeeze1-log), it takes 0.18s. With "16" factor approximation, <= 0.01s. Using somes indexes without any temporary table creation according to EXPLAIN... – jacouh Sep 22 '13 at 13:40
  • As a benchmark, "SELECT * FROM PRODUCTS WHERE HasImages = 1 ORDER BY RAND() LIMIT 3;" takes 28ms on my machine, which means that the above statement is ~33% faster. There might be something wrong with your table/mysql as @Anthony mentioned previously. – Tobias Strebitzer Sep 22 '13 at 14:37
  • 1
    I mean that your solution in my server takes 0.18s, the solution III with "SELECT Products.ID, Products.Name FROM Products WHERE ((Products.Hasimages=1) AND RAND() < 16 * 3/30000) LIMIT 3;" takes <= 0.01s. Not ORDER BY RAND() solution I. – jacouh Sep 22 '13 at 14:51
1

I've been testing the following bunch of SQLs on a 10M-record, poorly designed database.

SELECT COUNT(ID)
INTO @count
FROM Products
WHERE HasImages = 1;

PREPARE random_records FROM
'(
    SELECT * FROM Products WHERE HasImages = 1 LIMIT ?, 1
) UNION (
    SELECT * FROM Products WHERE HasImages = 1 LIMIT ?, 1
) UNION (
    SELECT * FROM Products WHERE HasImages = 1 LIMIT ?, 1
)';

SET @l1 = ROUND(RAND() * @count);
SET @l2 = ROUND(RAND() * @count);
SET @l3 = ROUND(RAND() * @count);

EXECUTE random_records USING @l1
    , @l2
    , @l3;
DEALLOCATE PREPARE random_records;

It took almost 7 minutes to get the three results. But I'm sure its performance will be much better in your case. Yet if you are looking for a better performance I suggest the following ones as they took less than 30 seconds for me to get the job done (on the same database).

SELECT COUNT(ID)
INTO @count
FROM Products
WHERE HasImages = 1;

PREPARE random_records FROM
'SELECT * FROM Products WHERE HasImages = 1 LIMIT ?, 1';

SET @l1 = ROUND(RAND() * @count);
SET @l2 = ROUND(RAND() * @count);
SET @l3 = ROUND(RAND() * @count);

EXECUTE random_records USING @l1;
EXECUTE random_records USING @l2;
EXECUTE random_records USING @l3;

DEALLOCATE PREPARE random_records;

Bear in mind that both these commands require MySQLi driver in PHP if you want to execute them in one go. And their only difference is that the later one requires calling MySQLi's next_result method to retrieve all three results.

My personal belief is that this is the fastest way to do this.

Mehran
  • 15,593
  • 27
  • 122
  • 221
  • I tested on Linux box that gives a 0.10 to 0.20s. At the end it gives only one record. – jacouh Sep 22 '13 at 15:45
  • I don't know what was I thinking, sorry. I've updated my answer and I hope this one is the one that you were looking for. – Mehran Sep 24 '13 at 07:26
  • I tested your first solution, with WHERE clause added: It gets merely 0, 1, 2 records, 3 is rare. – jacouh Sep 24 '13 at 08:52
  • Did you also add the condition to first `SELECT COUNT(ID)` query? – Mehran Sep 24 '13 at 08:56
  • Adding a WHERE clause to SELECT COUNT(ID) will have no effect, as the random offsets of l1, l2, l3 are not warranted with a row meeting HasImages=1 condition. – jacouh Sep 24 '13 at 09:13
  • To eliminate any misunderstanding I'll update the answer to include `HasImages=1`. – Mehran Sep 24 '13 at 09:15
1

What about creating another table containing only items with image ? This table will be much lighter as it will contain only one-third of the items the original table has !

------------------------------------------
|ID     | Item ID (on the original table)|
------------------------------------------
|0      | 0                              |
------------------------------------------
|1      | 123                            |
------------------------------------------
            .
            .
            .
------------------------------------------
|10 000 | 30 000                         |
------------------------------------------

You can then generate three random IDs in the PHP part of the code and just fetch'em the from the database.

ahmed
  • 313
  • 3
  • 15
  • This would be a good idea if your MySQLd is installed in a very powerful computer, so that it can ignore the work needed to do a JOIN: SELECT Products.ID, Products.Name FROM Products INNER JOIN tblIdsHasImages ON Products.ID=tblIdsHasImages.ItemID; For a mean server this additional work is considerable, it may cause CPU stuck. Copy all fields Products to tblIdsHasImages would be a bad idea. – jacouh Sep 23 '13 at 06:59
  • @jacouh Really? You think that the amount of work that a database-engine does to `UNION` 3 queries together, is smaller than that which it takes `JOIN` a PK to a FK? I'm willing to bet otherwise. – cartbeforehorse Sep 24 '13 at 13:32
1

On the off-chance that you're willing to accept an 'outside the box' type of answer, I'm going to repeat what I said in some of the comments.

The best way to approach your problem is to cache your data in advance (be that in an external JSON or XML file, or in a separate database table, possibly even an in-memory table).

This way you can schedule your performance-hit on the products table to times when you know the server will be quiet, and reduce your worry about creating a performance hit at "random" times when the visitor arrives to your site.

I'm not going to suggest an explicit solution, because there are far too many possibilities on how to build a solution. However, the answer suggested by @ahmed is not silly. If you don't want to create a join in your query, then simply load more of the data that you require into the new table instead.

cartbeforehorse
  • 3,045
  • 1
  • 34
  • 49
  • Using query random selection opens possibility to add other conditions such as to JOIN to the origin countries tables of the products, religions related information table, etc... that can not be easily covered by relatively static cache method. – jacouh Sep 24 '13 at 15:04
  • @jacouh Again, that depends how you design your cache. It – cartbeforehorse Sep 24 '13 at 15:12