3

I have two tables table1 and table2 their definitions are:

CREATE `table1` (
    'table1_id' int(11) NOT NULL AUTO_INCREMENT,
    'table1_name' VARCHAR(256),
     PRIMARY KEY ('table1_id')
)

CREATE `table2` (
    'table2_id' int(11) NOT NULL AUTO_INCREMENT,
    'table1_id' int(11) NOT NULL,
    'table1_name' VARCHAR(256),
     PRIMARY KEY ('table2_id'),
     FOREIGN KEY ('table1_id') REFERENCES 'table1' ('table1_id')
)

I want to know the number of rows in table1 that are NOT referenced in table2, that can be done with:

SELECT COUNT(t1.table1_id) FROM table1 t1 
WHERE t1.table1_id NOT IN (SELECT t2.table1_id FROM table2 t2)

Is there a more efficient way of performing this query?

A.O.
  • 3,733
  • 6
  • 30
  • 49

2 Answers2

3

try using EXISTS.. its generally more efficient than IN

SELECT COUNT(t1.table1_id) 
FROM table1 t1 
WHERE EXISTS
(   SELECT 1 
    FROM table2 t2
    WHERE t2.table1_id <=> t1.table1_id
)

you can do it with NOT EXISTS as well

SELECT COUNT(t1.table1_id) 
FROM table1 t1 
WHERE NOT EXISTS
(   SELECT 1 
    FROM table2 t2
    WHERE t2.table1_id = t1.table1_id
)

EXISTS is generally faster because the execution plan is once it finds a hit, it will quit searching since the condition has proved true. The problem with IN is it will collect all the results from the subquery before further processing... and that takes longer

As @billkarwin noted in the comments EXISTS is using a dependent subquery.. Here is the explain on my two queries and also the OP's query.. http://sqlfiddle.com/#!2/53199d/5

John Ruddell
  • 25,283
  • 6
  • 57
  • 86
  • 2
    Your example shows a *correlated* subquery, which will be executed for each distinct value in the outer query. Try it with EXPLAIN, it shows DEPENDENT SUBQUERY. – Bill Karwin Sep 09 '14 at 23:05
  • @BillKarwin hmm interesting.. so how is `EXISTS` faster than `IN()`? – John Ruddell Sep 09 '14 at 23:08
  • 2
    @BillKarwin Only in mysql. Normal brands of SQL handle `NOT EXISTS` as expected. – wildplasser Sep 09 '14 at 23:50
  • 1
    The OP's question was tagged [tag:mysql] so I assume they want to know about MySQL behavior. Besides, can you name any brand of SQL database that can optimize out a correlated subquery? I'm not arguing against the possibility, I've just never encountered one. – Bill Karwin Sep 10 '14 at 00:00
  • @BillKarwin I just posted the explain for all three queries... I have a few questions as I'm not extreemely familiar with this.. how is WHERE NOT EXISTS not have Using where in the extras? and how are the differences between exists and IN making it faster? should I post it as a question? – John Ruddell Sep 10 '14 at 01:20
  • 3
    @BillKarwin `exists` and `in` get exactly the same plan in Sql server with a semi join operator. Similarly with `not exists` and `not in` get an anti semi join operator (though not in can get additional baggage if either column is nullable). Generally SQL Server [can de-correlate sub queries in a number of cases](http://blogs.msdn.com/b/craigfr/archive/2006/10/04/decorrelating-subqueries.aspx) – Martin Smith Sep 14 '14 at 21:00
  • @MartinSmith, thanks for the confirmation and the link! – Bill Karwin Sep 15 '14 at 00:11
3

Upgrade to MySQL 5.6, which optimizes semi-joins against subqueries better.

See http://dev.mysql.com/doc/refman/5.6/en/subquery-optimization.html

Or else use an exclusion join:

SELECT COUNT(t1.table1_id) FROM table1 t1 
LEFT OUTER JOIN table2 t2 USING (table1_id)
WHERE t2.table1_id IS NULL

Also, make sure table2.table1_id has an index on it.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • So after reading through the link you provided, **NOT IN** resorts to Materialization or Exists, in my case Exists....so should I use the exclusion JOIN or the NOT EXISTS strategy? – A.O. Sep 09 '14 at 23:10
  • 1
    I'd stick with the exclusion join, because the materialization creates a temporary table. You should learn to use EXPLAIN to examine the optimization plan so you can test both query forms yourself. – Bill Karwin Sep 10 '14 at 00:04