Optimize a JOIN ORDER BY RAND() mysql query in a large database

Question

I am working on a project which has a large Question Bank, and for Tests added to the System, 20 questions are fetched on Run-Time dynamically based on the following query:

SELECT Question.* from Question JOIN Test 
ON Question.Subject_ID = Test.Subject_ID 
AND Question.Question_Level = Test.Test_Level 
ORDER BY RAND() 
LIMIT 20;

However, as it is known that the RAND() function the MySQL kills your server I have been looking for better solutions.

Result of EXPLAIN [above query]:

+----+-------------+----------+------+---------------+------+---------+------+------+----------------------------------------------------+
| id | select_type | table    | type | possible_keys | key  | key_len | ref  | rows | Extra                                              |
+----+-------------+----------+------+---------------+------+---------+------+------+----------------------------------------------------+
|  1 | SIMPLE      | Test     | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using temporary; Using filesort                    |
|  1 | SIMPLE      | Question | ALL  | NULL          | NULL | NULL    | NULL |    7 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+----------+------+---------------+------+---------+------+------+----------------------------------------------------+

Result of EXPLAIN Question:

+-------------------+------------------------------------------+------+-----+---------+----------------+
| Field             | Type                                     | Null | Key | Default | Extra          |
+-------------------+------------------------------------------+------+-----+---------+----------------+
| Question_ID       | int(11)                                  | NO   | PRI | NULL    | auto_increment |
| Questions         | varchar(100)                             | NO   |     | NULL    |                |
| Available_Options | varchar(200)                             | NO   |     | NULL    |                |
| Correct_Answer    | varchar(50)                              | NO   |     | NULL    |                |
| Subject_ID        | int(11)                                  | NO   |     | NULL    |                |
| Question_Level    | enum('Beginner','Intermediate','Expert') | NO   |     | NULL    |                |
| Created_By        | int(11)                                  | NO   |     | NULL    |                |
+-------------------+------------------------------------------+------+-----+---------+----------------+

Result of EXPLAIN Test:

+----------------+------------------------------------------+------+-----+---------+----------------+
| Field          | Type                                     | Null | Key | Default | Extra          |
+----------------+------------------------------------------+------+-----+---------+----------------+
| Test_ID        | int(11)                                  | NO   | PRI | NULL    | auto_increment |
| Test_Name      | varchar(50)                              | NO   |     | NULL    |                |
| Test_Level     | enum('Beginner','Intermediate','Expert') | NO   |     | NULL    |                |
| Subject_ID     | int(11)                                  | NO   |     | NULL    |                |
| Question_Count | int(11)                                  | NO   |     | NULL    |                |
| Created_By     | int(11)                                  | NO   |     | NULL    |                |
+----------------+------------------------------------------+------+-----+---------+----------------+

Any help would be appreciated to optimize the query to reduce server load and execution time.

P.S. The system has the capability of Deletion too so the AUTO_INCREMENT PRIMARY KEY of the QUESTION and TEST table can have large gaps.

Do you have indexes on Question.Subject_ID, Test.Subject_ID, Question.Question_Level, Test.Test_Level? — Nick, Aug 20 '16 at 14:28
No, I'm doing join on non-key Columns, as thats what the system demands. — Ayush Gupta, Aug 20 '16 at 14:31
If I could I would, but there is no Index association between the 2 tables. The same question can be used by 2 tests if they have the Same subject and level — Ayush Gupta, Aug 20 '16 at 15:19

score 2 · Answer 1 · answered Aug 21 '16 at 01:19

I like this question. It's a very good optimization puzzle, and let's assume for the moment that performance is very important for this query, and that you cannot use any dynamically inserted values (e.g. from PHP).

One high performance solution would be to add column with random values (say called "Rand"), order the table by this value, and periodically regenerate and re-order the table. You could then use a query like this one:

SELECT Question.* from Question 
JOIN Test 
ON Question.Subject_ID = Test.Subject_ID 
AND Question.Question_Level = Test.Test_Level  
WHERE Question.Rand > RAND() 
LIMIT 20

This would perform at O(n), requiring only one scan of the table, but it would come with the risk of returning fewer than 20 results if a value very close to 1 was generated. If this was an acceptable risk (e.g. you could programmatically check for an inadequate result and re-query), you would end up with nice runtime performance.

The periodic re-generating and re-ordering of the numbers is necessary because rows early in the table with high Rand values would be favored and show up disproportionately frequently in the results. (Imagine if the first row was lucky enough to receive a Rand value of .95)

Even better would be to create a column with contiguous integers, index on this column, and then randomly choose an insertion point to grab 20 results. Such a query might look like this:

SELECT Question.* from Question 
JOIN Test 
ON Question.Subject_ID = Test.Subject_ID 
AND Question.Question_Level = Test.Test_Level  
CROSS JOIN (SELECT MAX(Rand_id) AS max_id FROM Question)
WHERE Question.Rand_Id > ROUND(RAND() * max_id)
LIMIT 20

But what if you can't alter your table in any way? If it doesn't matter how messy your SQL is, and there is a relatively low proportion of missing ids (say roughly 1/10th). You could achieve your 20 random questions with a good degree of probability with the following SQL:

SELECT Question.* from Question JOIN Test 
  ON Question.Subject_ID = Test.Subject_ID 
  AND Question.Question_Level = Test.Test_Level 
  WHERE Question.Question_ID IN (
    SELECT DISTINCT(ROUND(rand * max_id)) AS rand_id 
    FROM ( --generate 30 random numbers to make sure we get 20 results
      SELECT RAND() AS rand UNION ALL
      SELECT RAND() AS rand UNION ALL
      SELECT RAND() AS rand UNION ALL
      SELECT RAND() AS rand UNION ALL
      ...
      SELECT RAND() AS rand UNION ALL
      SELECT RAND() AS rand UNION ALL
      SELECT RAND() AS rand
    ) a 
    CROSS JOIN ( --get the max possible id from the Question table
      SELECT MAX(id) AS max_id FROM Question
    ) b
  )
LIMIT 20 --finally pare our results down to 20 in case we got too many

However, this will cause problems in your use case, because you effectively can't know how many results (and their IDs) will be in the result set after the join. After joining on subject and difficulty, the proportion of missing IDs might be very high and you might end up with far fewer than 20 results, even with several hundred random guesses of what IDs might be in a table.

If you're able to use logic from PHP (sounds like you are), a lot of high performance solutions open up. You could, for example, create in PHP an object whose job it was to store arrays of all the IDs of Questions with a particular subject and difficulty level. You could then pick 20 random array indexes and get back 20 valid IDs, allowing you to run a very simple query.

SELECT Question.* from Question WHERE Question_ID IN ($dynamically_inserted_ids)

Anyway, I hope this gets your imagination going with some possibilities.

Linesofcode · Accepted Answer · 2016-08-20T22:23:22.280

1

Why don't you rand the numbers in PHP and then select the questions by id? Here's the logic of my point:

$MIN       = 1;
$MAX       = 50000; // You may want to get the MAX from your database
$questions = '';

for($i = 0; $i < 20; $i++)
   $questions .= mt_rand($MIN, $MAX) . ',';

// Removes last comma
$questions = rtrim($questions, ',');

$query = "SELECT * FROM Question WHERE Question.id IN ($questions)";

Edit 1:

I was thinking about the problem, and it ocurred me that you can select all the ID's from your db and then pick 20 items using the array_rand() function.

$values    = array(1, 5, 10000, 102021, 1000000); // Your database ID's
$questions = array_rand($values, 20);

$questions[0];
$questions[1];
$questions[2]; // etc

edited Aug 20 '16 at 22:23

answered Aug 20 '16 at 14:27

Linesofcode

5,327
13
62
116

1

Because, as I mentioned, my question ID WILL have gaps. So if PHP picked the ID 2, It might not even be in my database. – Ayush Gupta Aug 20 '16 at 14:29
What about returning your question id's into a simple array and then verify in the loop if the value exists within the array? – Linesofcode Aug 20 '16 at 14:33
Because, then I will have to run the `SELECT Question_ID from Question` query on a DB with over a million records. – Ayush Gupta Aug 20 '16 at 14:35
Not to mention the increase in execution time of the `in_array()` function in PHP, because i an array of 1 MILLION Question IDs, for 20 questions it will have to traverse the array 20*1M = 20 million times – Ayush Gupta Aug 20 '16 at 14:37
@AyushGupta a select on a million record DB is no problem if you index it. – wtm Aug 20 '16 at 17:28
@wtm yeah but the problem is the `in_array()` function. It has to verify 20 times within 1M records if the ID exists or not.. – Linesofcode Aug 20 '16 at 22:16

score 1 · Answer 3 · answered Aug 20 '16 at 17:25

Create the following indexes:

CREATE INDEX Question_Subject_ID_idx ON Question (Subject_ID);
CREATE INDEX Test_Subject_ID_idx ON Test (Subject_ID);
CREATE INDEX Question_Question_Level_idx ON Question (Question_Level);
CREATE INDEX Test_Test_Level_idx ON Test (Test_Level);

score 0 · Answer 4 · edited May 23 '17 at 12:23

I investigated on the same issue a while ago and my first approach was to load all IDs first, pick random ones in PHP (see: Efficiently pick n random elements from PHP array (without shuffle)) then query for these IDs directly in MySQL.

This was an improvement but memory-consuming for large data sets. On further investigation I found a better way: Pick random IDs in one query without any other fields or JOINs, then do your real query by these IDs:

SELECT Question.* from Question JOIN Test 
ON Question.Subject_ID = Test.Subject_ID 
AND Question.Question_Level = Test.Test_Level 
WHERE Question_ID IN (
    SELECT Question_ID from Question
    ORDER BY RAND() 
    LIMIT 20
);

Here's a blog post with benchmarks for my concrete case: Show random products in Magento.

Relevant parts:

Besides the memory issues, could it be that ORDER BY RAND() by itself is not the problem, but using it together with all the table joins of Magento? What if I preselect the random IDs with ORDER BY RAND()?

[...]

It was slightly slower than the PHP preselect approach, but still clearly in favor of the pure order by rand and without the increased memory usage in PHP.

[...]

The problem of the pure MySQL approach with ORDER BY RAND() became even more evident. While monitoring MySQL with mytop I noticed that besides for sorting, lots of time is spent for copying. The problem here seems to be, that sorting without an index, as with ORDER BY RAND() copies the data to a temporary table and orders that. With the flat index, all product attributes are fetched from a single table, which increases the amount of data copied to and from the temporary table for sorting. I might be missing something else here, but the performance dropped from bad to horrible, and it even caused my Vagrantbox to crash at first try because its disk got full (40 GB). So while PHP uses less memory with this approach, MySQL is all the more resource hungry.

I don't know how big your questions table is, at some point this approach is still flawed:

Second, as stated above, for big catalogs you should look for something different. The problem with ORDER BY RAND() is that even though we minimized the data to be copied, it still copies all rows to a temporary table and generates a random number for each. The sorting itself is optimized to not sort all rows (See LIMIT Optimization), but copying takes its time.

There is another famous blog post on selecting random rows in MySQL written by Jan Kneschke. He suggests using an index table with all ids, that has its own primary key without gaps. This index table would be updated automatically with triggers, and random rows can be selected by the index table, using random keys between min(key) and max(key).

If you don't use any additional conditions and query random entries from all questions this should work for you.

Optimize a JOIN ORDER BY RAND() mysql query in a large database

4 Answers4

Relevant parts: