An anti-join pattern is usually the most efficient approach, although there are several ways to get the same result.
SELECT a.id
, a.name
FROM table_a a
LEFT
JOIN table_b b
ON b.id = a.id
WHERE b.id IS NULL
Let's unpack that a bit.
The LEFT [OUTER] JOIN
operation gets us all rows from a, along with matching rows from b. The "trick" is to filter out all the rows that had a match; to do that, we use a predicate in the WHERE clause, that checks for a NULL value from b that we know won't be NULL if a match was found.
In this case, if we found a match, we know b.id is not null, since b.id = a.id
wouldn't return TRUE if b.id was NULL.
The anti-join won't create any "duplicate" rows from a (like a regular join can do). If you need to eliminate "duplicates" that already exist in a, adding a GROUP BY clause or adding the DISTINCT keyword before the select list is the way to go.
There are other approaches, like using a NOT EXISTS
predicate with a correlated subquery, or a NOT IN
with a subquery, but those forms are usually not as efficient.
FOLLOWUP
Actual performance of the queries is going to depend on several factors; having suitable indexes available is probably the biggest factor. The nullability of columns involved in predicates plays a role in the execution plan, as does cardinality, distribution of values, etc., MySQL version, and configuration of the server (e.g. innodb pool size)
As a test case:
SHOW VARIABLES LIKE 'version'
-- Variable_name Value
-- ------------- -----------------------------
-- version 5.5.35-0ubuntu0.12.04.2-log
CREATE TABLE `table_a` (
`id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`name_` VARCHAR(20) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=INNODB;
CREATE TABLE `table_b` (
`a_id` INT(10) UNSIGNED NOT NULL DEFAULT '0',
`item` INT(10) UNSIGNED NOT NULL DEFAULT '0',
PRIMARY KEY (`a_id`,`item`)
) ENGINE=INNODB;
-- table_a 1,000,000 rows, id values 1 through 1000000
-- table_b 990,000 rows, a_id values 1 through 1000000 (less a_id MOD 100 = 0)
left-join-where
-- EXPLAIN
SELECT /*! SQL_NO_CACHE */ a.id
, a.name_
FROM table_a a
LEFT
JOIN table_b b
ON b.a_id = a.id
WHERE b.a_id IS NULL
not-in
-- EXPLAIN
SELECT /*! SQL_NO_CACHE */ a.id
, a.name_
FROM table_a a
WHERE a.id NOT IN (SELECT b.a_id FROM table_b b)
not-exists
-- EXPLAIN
SELECT /*! SQL_NO_CACHE */ a.id
, a.name_
FROM table_a a
WHERE NOT EXISTS
(SELECT 1
FROM table_b b
WHERE b.a_id = a.id)
Performance results (in seconds):
run 2 run 3 run 4 run 5 avg
----- ----- ----- ----- -----
left-join-where 0.227 0.227 0.227 0.227 0.227
not-in 0.233 0.233 0.234 0.233 0.233
not-exists 1.031 1.029 1.032 1.031 1.031
EXPLAIN output for the three queries:
left-join-where
id select_type table type possible_ key key_len ref rows Extra
-- ----------- ----- -------------- --------- ------- ------- ------ ------- ------------------------------------
1 SIMPLE a ALL 1000392
1 SIMPLE b ref PRIMARY PRIMARY 4 a.id 1 Using where; Using index; Not exists
not-in
id select_type table type possible_ key key_len ref rows Extra
-- ------------------ ----- -------------- --------- ------- ------- ------ ------- ------------------------------------
1 PRIMARY a ALL 1000392 Using where
2 DEPENDENT SUBQUERY b index_subquery PRIMARY PRIMARY 4 func 1 Using index
not-exists
id select_type table type possible_ key key_len ref rows Extra
-- ------------------ ----- ------ --------- ------- ------- ------ ------- ------------------------------------
1 PRIMARY a ALL 1000392 Using where
2 DEPENDENT SUBQUERY b ref PRIMARY PRIMARY 4 a.id 1 Using index