-1

Im not good in sql, help me please understand this situation. We have mysql 5.6 and table with ~30k rows. I need to find record with some conditions and max date. This is a query: (cant paste the original query cause of nda)

    SELECT * FROM records r1 
    WHERE column1='X' AND column2='Y' AND
    column3 = (SELECT MAX(column3) FROM records r2 
               WHERE r2.column1=r1.column1 AND r2.column2=r1.column2 AND r2.column3 <= '<some_date>');

Duration: 14.094 sec
Explain:

select_type table type possible_keys key key_len ref rows extra
PRIMARY r1 all 30k using where
DEPENDENT SUBQUERY r2 all 30k using where

I guess it is so called correlative subquery

And there is another one using derived table:

    SELECT * FROM records r1 
    WHERE column1='X' AND column2='Y' AND
    column3 = (SELECT MAX(column3) FROM 
                  (SELECT * FROM records 
                   WHERE column1='X' AND column2='Y' AND column3 <= '<some_date>') r2);    

Duration: 0.047 sec
Explain:

select_type table type possible_keys key key_len ref rows extra
PRIMARY r1 all 30k using where
SUBQUERY derived all 30k
DEPENDENT SUBQUERY r2 all 30k using where

Both queries return same result, why is there such big difference in performance? Does it mysql 5.6 thing? I dont see any clue in explain, maybe i just dont understand it well.

Nik
  • 3
  • 2
  • If you only ran the queries once, it's possible this is simply disk caching. The first query had to load the data from the disk and cache it, and the second one used the very fast disk cache. Run each query a few times to be sure. More importantly: you appear to have no indexes, that's going to be the real performance problem. See [Use The Index, Luke](https://use-the-index-luke.com/) for more. – Schwern Oct 14 '22 at 18:04
  • Note that [MySQL 5.6 reached the end of its life last year](https://dev.mysql.com/doc/relnotes/mysql/5.6/en/news-5-6-51.html). Consider upgrading to at least MySQL 5.7, but MySQL 8 would be better as it introduces a lot of missing features. – Schwern Oct 14 '22 at 18:15
  • Where are complete CREATE TABLE scripts? – Akina Oct 14 '22 at 18:19
  • PS. Your queries are not equivalent from the formal looking point. The optimizer is not so smart for to understand that these queries performs the same action (moreover, it must take into account that dirty reads may be allowed now). – Akina Oct 14 '22 at 18:22
  • *why is there such big difference in performance?* It seems that your tables have no any indices except PK maybe.. – Akina Oct 14 '22 at 18:24
  • @Schwern Thanks for replying. Actually, there is a part of more complex query, translated from hibernate. I've just tried to simplify it. And it seems that cache isn't the reason, i ran it many times. About indexes - this is not relevant for these columns, but indexes are used on more significant columns. – Nik Oct 14 '22 at 18:33
  • Here's a guess (if it doesn't turn out to be just disk caching): the first subquery depends on each row. It has to run the subquery for each row. ***Because you lack indexes***, for each row it has to scan the whole table again. It is [O(n^2)](https://stackoverflow.com/a/107197/14660). For your 30,000 rows it must scan them 900 million times. The second subquery has fixed values. MySQL only has to run the subquery once and cache the value. It only has to scan the 30,000 rows twice (once for the subquery, once for the main query) making it O(1). – Schwern Oct 14 '22 at 18:33
  • @Akina thanks for replying, actually table has 13 columns and 3 of it are indexes. – Nik Oct 14 '22 at 18:41
  • @Nik As your explain shows, none of those indexes are relevant to this query. If you want this query to perform well at any sort of scale, you'll need to apply indexes to one or all of the relevant columns. – Schwern Oct 14 '22 at 18:44
  • @Schwern yes. do you think second query is not a good idea for optimizing performance and its better try to add additional indices? upd: ok, see the answer in your edited comment, thanks – Nik Oct 14 '22 at 18:50
  • *actually table has 13 columns and 3 of it are indexes.* Your words tells nothing. Show complete CREATE TABLE scripts. – Akina Oct 14 '22 at 18:53
  • @Schwern but if these three columns used together only in one place, are they worth indices? There is no problems with another queries at this table – Nik Oct 14 '22 at 19:12
  • @Nik Well, apparently you're having a problem with this query. ;) I can't say without seeing your other queries and your ***full table schema***. What I can say is table scans will become an increasing performance problem as the table grows. And (if my guess is correct) part of the reason the first query is so slow is because it has to do 30,000 table scans. – Schwern Oct 14 '22 at 19:23
  • @Schwern But the second query also does full table scan, and its explain looks worse, am I wrong? – Nik Oct 14 '22 at 19:40
  • Apart from indices. What is the difference? First query has correlated subquery and based on already filtered r1 table, am i right? And second query has derived table. Is it right to say that correlated subquery is slower than derived table, at least in mysql 5.6, or i'm completely wrong? – Nik Oct 14 '22 at 19:50
  • @Nik Running an [optimizer trace](https://bobcares.com/blog/mysql-optimizer-trace/) would shine light on what's going on. Also see https://dev.mysql.com/doc/dev/mysql-server/latest/PAGE_OPT_TRACE.html – Schwern Oct 14 '22 at 23:17
  • Were you hoping for one row of output or several? – Rick James Oct 15 '22 at 19:41
  • "is a part of more complex query" -- If you want to fix _that_, we need to see _it_, plus `SHOW CREATE TABLE`. (I don't think you 'simplified' it correctly.) – Rick James Oct 15 '22 at 19:48

1 Answers1

0

First, and most important, none of the columns are indexed. Each search of the table will result in at least one table scan. And the subqueries might have to scan the table for each match. That's bad. A composite index (column1, column2) would probably help immensely.

Second, MySQL 5.6 was end-of-lifed in 2021. Consider upgrading to at least MySQL 5.7 and probably MySQL 8.0 which adds many important features and may improve the query optimizer. In particular, MySQL 5.6 and 5.7 both say...

For certain cases, a correlated subquery is optimized... Otherwise, they are inefficient and likely to be slow. Rewriting the query as a join might improve performance.

Whereas MySQL 8 no longer has that warning instead saying...

Beginning with MySQL 8.0.24, the optimizer can transform a correlated scalar subquery to a derived table when the subquery_to_derived flag of the optimizer_switch variable is enabled.

I'd suggest trying your query on MySQL 8.0.


An optimizer trace would provide more information, but here's what I think is happening.

SELECT *
FROM records r1 
WHERE column1='X'
  AND column2='Y'
  AND column3 = (
    SELECT MAX(column3)
    FROM (
      SELECT *
      FROM records 
      WHERE column1='X'
        AND column2='Y'
        AND column3 <= '<some_date>'
    ) r2
  ); 

Note that the subquery where clause uses all hard coded values. That means it only needs to be run once and the value can be cached. That's a pretty simple optimization that the optimizer could be doing. We can think of it like so:

-- One table scan
SELECT @subquery:=MAX(column3)
FROM (
  SELECT *
  FROM records 
  WHERE column1='X'
    AND column2='Y'
    AND column3 <= '<some_date>'
  ) r2
);

-- One table scan
SELECT *
FROM records r1 
WHERE column1='X'
  AND column2='Y'
  AND column3 = @subquery;

With 30,000 rows, two table scans is just 60,000 rows.

That's one possibility. An optimizer trace would tell the whole story.


Now let's look at the first query.

SELECT *
FROM records r1 
WHERE column1='X'
  AND column2='Y'
  AND column3 = (
    SELECT MAX(column3)
    FROM records r2 
    WHERE r2.column1=r1.column1
      AND r2.column2=r1.column2
      AND r2.column3 <= '<some_date>'
    );

Again, because of the lack of indexes the outer query has to do a full table scan. The subquery must also do a full table scan, but because it is correlated it must do so for each matching row.

Let's assume the optimizer is smart and it checks that column1 and column2 match before checking column3. It must run the subquery for each matching row. Each run is a full table scan. Let's say there are 300 rows with matching column1 and column2. That's 301 table scans or 9+ million rows. That would explain the performance difference.

The optimizer could be smarter and only run the subquery once on the subset of rows which match on column1 and column2, but it doesn't seem to be doing that. MySQL 8 might be able to better optimize the query.


If you don't want to repeat the where conditions in the subquery, try a self-join.

SELECT *
FROM records r1 
JOIN (
  SELECT column1, column2, MAX(column3) max3
  FROM records
  WHERE column3 <= '<some_date>'
  GROUP BY column1, column2
) r2 ON
  r1.column1 = r2.column1 AND
  r1.column2 = r2.column2 AND
  r1.column3 = r2.max3
Schwern
  • 153,029
  • 25
  • 195
  • 336
  • @Nik Did the self-join perform well? – Schwern Oct 15 '22 at 19:18
  • `INDEX(column1, column2, column3)` will help even more. – Rick James Oct 15 '22 at 19:41
  • @RickJames 3 column indexes are often overkill and bloat index sizes. It depends on how the data is distributed. If `column1='X' and column2='Y'` filter down to a handful of rows there's no need to also index column3. – Schwern Oct 15 '22 at 21:44
  • @Schwern - I think that, in this case, it would provide a huge benefit -- simply locate the "last" row to get the `MAX`. – Rick James Oct 15 '22 at 22:57
  • @RickJames If the constraints on column1 and column2 reduce the set down to a few hundred rows, there's likely to be little difference between sorting that small set and using the indexes. Indexes have overhead, and for small data can be slower. Also, the 3rd column may slow down using the index for just 2 columns; the extra layer can make walking the tree slower. – Schwern Oct 16 '22 at 18:24
  • @RickJames Put another way, O(nlogn) can beat O(logn) for small data sets. And optimizing a small data set is unlikely to have a large effect. If a 2 column index brings it from 10000ms to 10ms, the most you can shave off now is 10ms. – Schwern Oct 16 '22 at 18:36
  • I do agree that _some_ distributions of data will lead to the 2-column index being slightly better. However (in this example), the index would also be "covering" -- this eliminates the bouncing between index BTree and the data BTree. Furthermore, if col3 is really a `DATE`, that is only 3 extra bytes (plus overhead). – Rick James Oct 16 '22 at 18:46
  • 1
    @Schwern Hi! Yes, self-join performs excellent! – Nik Oct 31 '22 at 08:40