1

I wanted to find all hourly records that have a successor in a ~5m row table.

I tried :

SELECT DISTINCT (date_time)
FROM my_table
JOIN (SELECT DISTINCT (DATE_ADD( date_time, INTERVAL 1 HOUR)) date_offset
      FROM my_table) offset_dates
ON date_time = date_offset

and

SELECT DISTINCT(date_time)
FROM my_table
WHERE date_time IN (SELECT DISTINCT(DATE_ADD(date_time, INTERVAL 1 HOUR))
                    FROM my_table)

The first one completes in a few seconds, the seconds hangs for hours. I can understand that the sooner is better but why such a huge performance gap?

-------- EDIT ---------------

Here are the EXPLAIN for both queries

id  select_type table       type    possible_keys   key     key_len ref                         rows    Extra
1   PRIMARY     <derived2>  ALL     NULL            NULL    NULL    NULL                        1710    Using temporary
1   PRIMARY     my_table    ref     PRIMARY         PRIMARY 8       offset_dates.date_offset    555     Using index
2   DERIVED     my_table    index   NULL            PRIMARY 13      NULL                        5644204 Using index; Using temporary


id  select_type        table    type    possible_keys   key     key_len ref     rows    Extra
1   PRIMARY            my_table range   NULL            PRIMARY 8       NULL    9244    Using where; Using index for group-by
2   DEPENDENT SUBQUERY my_table index   NULL            PRIMARY 13      NULL    5129983 Using where; Using index; Using temporary
Mad Echet
  • 3,723
  • 7
  • 28
  • 44
  • I might be wrong, but it is the subquery that is very heavy. So if you execute it with a `JOIN` and `ON` clause, it would be faster because it does not retreive every result of the subselect. In the second query, you first retreive all results of the subselect and then look in it. – Brewal Aug 06 '13 at 16:10

4 Answers4

2

I would prefix both queries by explain, and then compare the difference in the access plans. You will probably find that the first query looks at far fewer rows than the second.

But my hunch is that the JOIN is applied more immediately than the WHERE clause. So, in the WHERE clause you are getting every record from my_table, applying an arithmetic function, and then sorting them because select distinct usually requires a sort and sometimes it creates a temporary table in memory or on disk. The # of rows examined is probably the product of the size of each table.

But in the JOIN clause, a lot of the rows that are being examined and sorted in the WHERE clause are probably eliminated beforehand. You probably end up looking at far fewer rows... and the database probably takes easier measures to accomplish it.

But I think this post answers your question best: SQL fixed-value IN() vs. INNER JOIN performance

Community
  • 1
  • 1
ktm5124
  • 11,861
  • 21
  • 74
  • 119
  • Thanks, I added `EXPLAIN` in my question. I am not fluent in this kind of query plan but obviously they are completely different. – Mad Echet Aug 06 '13 at 16:23
2

In general, a query using a join will perform better than an equivalent query using IN (...), because the former can take advantage of indexes while the latter can't; the entire IN list must be scanned for each row which might be returned.

(Do note that some database engines perform better than others in this case; for example, SQL Server can produce equivalent performance for both types of queries.)

You can see what the MySQL query optimizer intends to do with a given SELECT query by prepending EXPLAIN to the query and running it. This will give you, among other things, a count of rows the engine will have to examine for each step in a query; multiply these counts to get the overall number of rows the engine will have to visit, which can serve as a rough estimate of likely performance.

Aaron Miller
  • 3,692
  • 1
  • 19
  • 26
1

'IN' clause is usually slow for huge tables. As far as I remember, for the second statement you printed out - it will simply loop through all rows of my_table (unless you have index there) checking each row for match of WHERE clause. In general IN is treated as a set of OR clauses with all set elements in it. That's why, I think, using temporary tables that are created in background of JOIN query is faster.

Here are some helpful links about that:

MySQL Query IN() Clause Slow on Indexed Column

inner join and where in() clause performance?

http://explainextended.com/2009/08/18/passing-parameters-in-mysql-in-list-vs-temporary-table/

Community
  • 1
  • 1
Anton M
  • 100
  • 6
  • Thanks, all things point towards favoring `JOIN` but I am still puzzled with the magnitude of the difference. – Mad Echet Aug 06 '13 at 16:48
  • 1
    Taking into consideration Aaron Miller's answer for my one - IN clause doesn't use index at all and as I said simply goes looping through table checking each row on matching the set of values of IN cluase; while JOIN builds temp table first (and you're using index for this JOIN, right?) and afterwards gets rows. – Anton M Aug 06 '13 at 17:00
  • 1
    This way, iterating through 5 million rows will take a lot of time (well, actually depends on some DB set up and environment). Hope that helps. – Anton M Aug 06 '13 at 17:01
1

Another things to consider is that with your IN style, very little future optimization is possible compared to the JOIN. With the join you can possibly add an index, which, who knows, it depends on the data set, it might speed things up by a 2, 5, 10 times. With the IN, it's going to run that query.

MotoRidingMelon
  • 2,347
  • 2
  • 21
  • 28