0

I'm not a SQL expert, so am trying to understand why two queries have substantially different execution times.

common_stats is a large table (3 million rows). I'm just trying to understand why the huge difference in performance.

The below query takes ~15 seconds:

select distinct cs.basesalary_id
from common_stats AS cs
LEFT JOIN basesalary AS b ON b.id = cs.basesalary_id
WHERE (b.employee_id= 100 AND cs.amount > 0 AND cs.pay_id is null );

This query takes ~.1 seconds:

select distinct basesalary_id from (
    select cs.basesalary_id, cs.pay_id
    from common_stats AS cs
    LEFT JOIN basesalary AS b ON b.id = cs.basesalary_id
    WHERE (b.employee_id= 100 AND cs.amount > 0)
) as temp
where pay_id is null;
Phylogenesis
  • 7,775
  • 19
  • 27
cdm
  • 719
  • 2
  • 10
  • 22
  • 1
    Have you tried putting `explain` in front of the query and comparing the execution plans? – Phylogenesis Oct 07 '15 at 14:53
  • @e4c5 But this example is a subquery significantly outperforming a join. – Phylogenesis Oct 07 '15 at 15:05
  • Bit of a duplicate question: http://stackoverflow.com/questions/2577174/join-vs-sub-query – Dwight Spencer Oct 07 '15 at 15:05
  • @Phylogenesis oh sorry I misread that one! – e4c5 Oct 07 '15 at 15:09
  • Always use `EXPLAIN` (and do post its output here). Always ensure you're not pulling from query cache by issuing `RESET QUERY CACHE;` if you have two queries that operate on the same data. Another very, very important thing to bear in mind - you can optimize SQL to infinity and beyond but hardware plays a hugely important role. It's important to know which engine you're using and whether it uses your hardware optimally. – Mjh Oct 07 '15 at 15:21
  • Could you provide sqlfiddle please? – Alex Oct 07 '15 at 15:22

1 Answers1

2

As a general rule:

  • If the subquery 'shrinks' the number of rows (say, via GROUP BY or LIMIT), the subquery approach is better.
  • JOIN often creates more rows than the original tables. DISTINCT is sometime used to deflate that row count.
  • If the outer query has a GROUP BY, JOIN may create more rows than you realize and "inflate" any aggregates (SUM, COUNT, etc), thereby voting for subquery instead.
  • More than one subquery would lead to poor optimization. (Thereby voting for JOIN.)

All of those assume optimal indexes.

Your first query would probably benefit from

INDEX(pay_id, amount, basesalary_id) -- both "covering" and optimal for `WHERE`

Check out EXPLAIN SELECT ... for both queries. Probably the faster one starts with basesalary and there is INDEX(employee_id) and that is very selective.

I may have more comments after seeing SHOW CREATE TABLE.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • When using a subquery, the explain shows the following sequence: 1 - PRIMARY - NO KEYS - 667154 rows - Extra: "Using where" 2 - DERIVED - NO KEYS - 667154 rows - Extra: none 3 - DERIVED - PRIMARY KEY - 1 row - Extra: none When using a join, explain shows this sequence: 1 - SIMPLE - NO KEYS - 667171 rows - Extra: none 2 - SIMPLE - PRIMARY KEY - 1 row - Extra: using where – cdm Nov 12 '15 at 18:04
  • I'm sorry; that is too hard to read; please edit your Question to add it. – Rick James Apr 21 '21 at 23:11