0

Lots of thread already on web, just trying to understand some nuances which had me confused!

Quoting the doc reference

If you combine LIMIT row_count with ORDER BY, MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast.

and a SO thread

It will order first, then get the first 20. A database will also process anything in the WHERE clause before ORDER BY.

Taking the same query from the question :

SELECT article
FROM table1
ORDER BY publish_date
LIMIT 20

lets say table has 2000 rows, of which query is expected to return 20 rows, now, looking at mysql ref ....stops sorting as soon as it has found the first row_count rows.... confuses me as i find it little ambiguous!!

Why does it say stops sorting? isn't the limit clause being applied on an already sorted data returned via order by clause ( assuming its a non-indexed column ) or is my understanding wrong and SQL is limiting first and then sorting!!??

NoobEditor
  • 15,563
  • 19
  • 81
  • 112

2 Answers2

1

The optimization mentioned in the documentation generally only works if there's an index on the publish_date column. The values are stored in the index in order, so the engine simply iterates through the index of the column, fetching the associated rows, until it has fetched 20 rows.

If the column isn't indexed, the engine will generally need to fetch all the rows, sort them, and then return the first 20 of these.

It's also useful to understand how this interacts with WHERE conditions. Suppose the query is:

SELECT article
FROM table1
WHERE last_read_date > '2018-11-01'
ORDER BY publish_date
LIMIT 20

If publish_date is indexed and last_read_date is not, it will scan the publish_date index in order, test the associated last_read_date against the condition, and add article to the result set if the test succeeds. When there are 20 rows in the result set it will stop and return it.

If last_read_date is indexed and publish_date is not, it will use the last_read_date index to find the subset of all the rows that meet the condition. It will then sort these rows using the publish_date column, and return the first 20 rows from that.

If neither column is indexed it will do a full table scan to test last_read_date, sort all the rows that match the condition, and return the first 20 rows of this.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Hi Barmar, thank you for the explanation. Couple of doubts : **1**. Depending on the `index`ed column, `mysql` might actually end up returning me different set of rows in all 3 cases you outlined? **2**. `If publish_date is indexed and last_read_date is not` -> since default `order by` is in `ASC` this query will not actually return me the oldest `article`, as filter criteria is on `order by` and not `where` clause? ( _meaning oldest `publish date` combined with there `last read` is returned in this case_!!! ) – NoobEditor Dec 09 '18 at 03:41
  • 1
    If you have multiple rows with the same publish_date, such that the 20 row cut off point ends in a group of rows with the same publish date, you might get different results from your queries. For example, say you publish 6 a day and they are all read. You'll always get the same 18 from the first 3 days, but you'll only get 2 from the fourth day, and which 2 you get may differ depending on which query variant you use. If you had a different publish_date on each row, you'll always get the same 20, no matter which index there is. The index changes how mysql gets the data, not what you get. – DancingFool Dec 10 '18 at 00:33
  • 1
    @NoobEditor Unless there are duplicate `publish_date` values, you should get the same results. These are just performance optimizations, they don't change the meaning of the query. – Barmar Dec 10 '18 at 04:51
  • 1
    if there are duplicates, and more than 20 rows match the criteria, it's unpredictable which 20 it will select, and different optimizations may result in it selecting different onces. – Barmar Dec 10 '18 at 04:53
1

MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result

This is actually a very sensible optimisation within mysql. If you use limit to return 20 rows and mysql knows it already found them, then why would mysql (or you) care how exactly the rest of the records are sorted? It does not matter, therefore mysql stops sorting the rest of the rows.

If the order by is done on an indexed column, then mysql can tell pretty quickly, if it found the top n records.

Shadow
  • 33,525
  • 10
  • 51
  • 64