Bad performance with order by random field

Question

I've quite simple query

SELECT id,url,submitted FROM links where  submit_after<CURDATE() and submitted<4 order by ord limit 1438;

Table links has 10 000 000 rows and indexes on submitted,submit_after,ord. The ord field is changing very often because it's used as random field to get always random rows. The limit for query is also rand(1000,4000). Query takes 10 sec to execute

id  select_type table   type    possible_keys   key key_len ref rows    Extra   
1   SIMPLE  links   index   submitted,submit_after  ord 4   NULL    3849    Using where

Any suggestions how to get better performance?

It's a terible option using rand() see http://stackoverflow.com/questions/4329396/mysql-select-10-random-rows-from-600k-rows-fast — Alex, Aug 25 '15 at 13:05
consider creating a pre-calculated random number table with a sliding window of use, one-use only, primed/re-freshed by a separate process than the consumer of it — Drew, Aug 25 '15 at 13:09

score 3 · Answer 1 · answered Aug 25 '15 at 13:09

3

MySQL only can use ONE index per Query. You can see it in the EXPLAIN output. it will use "ord" index.

So create a index over all Feld in the right order (submit_after, submitted, ord) to speed up the Query

answered Aug 25 '15 at 13:09

Bernd Buffen

14,525
2
24
39

While I agree with the basic idea, I am not sure that the index is going to help that much due to the way that the values overlap. For example every row that matches the submit_after WHERE clause will need checking for submitted, as it is using < rather than = on the checks, so by the time it gets to using ORD the values are all over the place. If submitted has a limited number of values then it might well be better to have an index of submitted THEN submit_after – Kickstart Aug 25 '15 at 13:16
Now query takes 5 sec to execte is also svery slow. Explain shows "Using index condition; Using MRR; Using filesort" – john93230 Aug 25 '15 at 13:19
@Kickstart can You explain what you exatcly mean index on submitted? – john93230 Aug 25 '15 at 13:23
@john93230 - the field called submitted. As Bernd correctly points out MySQL will only use one index on a table in a query, but the index can cover multiple columns. Problem with your query is that it isn't specifically checking any values, rather checking that a value is less than a value. So the index is not going to narrow things down much. But if the value of submitted can be from (say) 0 to 1000, narrowing it down to less than 4 is probably going to limit the number of rows more effectively than checking against the current date first. – Kickstart Aug 25 '15 at 13:31
Checked your solution but is also doesn't speed much - still 4 sec to execute – john93230 Aug 25 '15 at 13:33

score 1 · Accepted Answer · answered Aug 25 '15 at 14:09

A major problem with this query is that indexes can't help much.

It is checking that any returned row has submitted < 4 . But one it has done that it doesn't narrow things down well for the next indexed field to check (ie submit_after < CURDATE() ).

This is because it is bring back a range of rows, but within that range it is checking for another independent range, so the 2nd part of the index is not that helpful.

Same if it does things the other way round, but probably worse (as there will be a lot more submit_after values each with their own range of values for submitted ).

As such the index covering multiple columns is probably not massively helping to reduce the number of records. Let alone having the 3rd column on the index to order the results.

This is probably why when presented with 3 separate indexes MySQL chose to use the index that allowed it to sort the results, and just checked the rows without an index for the WHERE clause.

I presume this query is for some kind of site link submit checking system, with links being valid to be checked after a certain date if they have not yet been checked. If so it might improve things if one of the values of submitted is specifically for rows that have submit_after < CURDATE() and are ready to be checked. This could then be updated by a separate query, and this SELECT could then just check for that value of submitted . An index covering both submitted and ord would then be useful.

score 0 · Answer 3 · answered Aug 25 '15 at 13:24

0

You better use something like

$id = rand(1,10000000); SELECT from TABLE WHERE id = $id

answered Aug 25 '15 at 13:24

Bernhard

1,852
11
19

**bold**The best idea – Bernd Buffen Aug 25 '15 at 13:29
Assuming that the random row id is one that is returned and that only a single row is required (rather than 1438 rows) – Kickstart Aug 25 '15 at 13:33
Thats not an impossible issue to solve imho - a rather easy issue, isnt it? – Bernhard Aug 25 '15 at 13:36
You can use WHERE id >= $id to ensure you get a row. The LIMIT is far harder to fix, but the original posts current code has that issue as well. – Kickstart Aug 25 '15 at 13:44

O. Jones · Answer 4 · 2015-08-26T11:12:38.347

You didn't say what performance you require. But you have a challenging query to satisfy.

Here's a suggestion for you. Do a so-called deferred join. As it is you're grabbing a lot of rows, ordering them, and then throwing out most of them. Let's order and throw out less stuff.

A subquery gets the id values you need to retrieve, like so.

         SELECT id
           FROM links 
          WHERE submit_after < CURDATE()
            AND submitted < 4
          ORDER BY ord 
          LIMIT 1438

This subquery can be optimized by a compound index, but (as others have pointed out) not perfectly.

Which of your WHERE criteria is more selective? Which cuts down the number of resulting rows more aggressively? If it's the CURDATE criterion, use this index: (submit_after, submitted, ord). If it's the submitted < 4 criterion, use this index. (submitted, submit_after, ord). These are called compound covering indexes.

Now, running that query gives you a list of ID values. All that has to be sorted is id and ord, a lesser workload than sorting the whole mess.

Edit. It's quite true that the number of rows to be sorted is the same in the subquery as in the main query. So, you might ask why this saves time. Simply because there are fewer, and shorter, columns, to sort in the subquery. Sorting has to manipulate a lot of data (EXPLAIN calls it "filesort"). Cutting down the volume of data to be manipulated can be very helpful.

To finish your job you use a JOIN like so:

SELECT a.id, a.url, a.submitted
  FROM links AS a
  JOIN (
         SELECT id
           FROM links 
          WHERE submit_after < CURDATE()
            AND submitted < 4
          ORDER BY ord 
          LIMIT 1438
       ) AS b ON a.id = b.id
 ORDER BY a.ord

This retrieves a relatively much smaller number of rows from your main table once you have selected the ids you want. You end up sorting far less data: the ORDER BY this LIMIT that pattern is wasteful because it discards a lot of already-sorted data.

Notice that you have to re-sort the main query. You may, upon experimenting, find that the main query is already in the right order without the second ORDER BY. But you cannot rely on that behavior of the MySQL server.

Am I missing something here? Your sub query is exactly the same as the original query that is taking too long, except for only returning the id. With the same overhead for the SELECT and also sorting on ord. I can see that there is less data in the set to be sorted, but it is still the same number of rows to be sorted. — Kickstart, Aug 26 '15 at 08:25
Please see my edit. This sort of deferred-join approach has actually worked for me, even if it's a little counterintuitive. — O. Jones, Aug 26 '15 at 11:13
Interesting. Would like to have a play with this on this query (although I suspect the big issue is the WHERE clause). — Kickstart, Aug 26 '15 at 11:23

Bad performance with order by random field

4 Answers4