0

i have two tables:

table 'A'
| id | name  |
| 1  | Larry |
| 2  | Maria |
| 3  | Ponyo |
| 4  | Panda |


table 'B'
| m_id | items |
| 1    |  7    |
| 2    |  9    |

I just want to display the records from table 'A' that are NOT on table 'B'. So that would be

| 3  | Ponyo |
| 4  | Panda |

only.

potashin
  • 44,205
  • 11
  • 83
  • 107
hugseirvak
  • 315
  • 1
  • 2
  • 11

1 Answers1

2

An anti-join pattern is usually the most efficient approach, although there are several ways to get the same result.

SELECT a.id
     , a.name
  FROM table_a a
  LEFT
  JOIN table_b b
    ON b.id = a.id
 WHERE b.id IS NULL

Let's unpack that a bit.

The LEFT [OUTER] JOIN operation gets us all rows from a, along with matching rows from b. The "trick" is to filter out all the rows that had a match; to do that, we use a predicate in the WHERE clause, that checks for a NULL value from b that we know won't be NULL if a match was found.

In this case, if we found a match, we know b.id is not null, since b.id = a.id wouldn't return TRUE if b.id was NULL.

The anti-join won't create any "duplicate" rows from a (like a regular join can do). If you need to eliminate "duplicates" that already exist in a, adding a GROUP BY clause or adding the DISTINCT keyword before the select list is the way to go.


There are other approaches, like using a NOT EXISTS predicate with a correlated subquery, or a NOT IN with a subquery, but those forms are usually not as efficient.


FOLLOWUP

Actual performance of the queries is going to depend on several factors; having suitable indexes available is probably the biggest factor. The nullability of columns involved in predicates plays a role in the execution plan, as does cardinality, distribution of values, etc., MySQL version, and configuration of the server (e.g. innodb pool size)

As a test case:

SHOW VARIABLES LIKE 'version'
-- Variable_name  Value                        
-- -------------  -----------------------------
-- version        5.5.35-0ubuntu0.12.04.2-log  

CREATE TABLE `table_a` (
  `id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  `name_` VARCHAR(20) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=INNODB;

CREATE TABLE `table_b` (
  `a_id` INT(10) UNSIGNED NOT NULL DEFAULT '0',
  `item` INT(10) UNSIGNED NOT NULL DEFAULT '0',
  PRIMARY KEY (`a_id`,`item`)
) ENGINE=INNODB;

-- table_a  1,000,000 rows, id values   1 through 1000000
-- table_b    990,000 rows, a_id values 1 through 1000000 (less a_id MOD 100 = 0)

left-join-where

-- EXPLAIN
SELECT /*! SQL_NO_CACHE */ a.id
     , a.name_
  FROM table_a a
  LEFT
  JOIN table_b b
    ON b.a_id = a.id
 WHERE b.a_id IS NULL

not-in

-- EXPLAIN 
SELECT /*! SQL_NO_CACHE */ a.id
     , a.name_
  FROM table_a a
 WHERE a.id NOT IN (SELECT b.a_id FROM table_b b)

not-exists

-- EXPLAIN 
SELECT /*! SQL_NO_CACHE */ a.id
     , a.name_
  FROM table_a a
 WHERE NOT EXISTS 
       (SELECT 1
          FROM table_b b
         WHERE b.a_id = a.id)

Performance results (in seconds):

                  run 2  run 3  run 4  run 5  avg
                  -----  -----  -----  -----  -----
left-join-where   0.227  0.227  0.227  0.227  0.227 
not-in            0.233  0.233  0.234  0.233  0.233
not-exists        1.031  1.029  1.032  1.031  1.031

EXPLAIN output for the three queries:

left-join-where  
id  select_type        table type           possible_ key     key_len ref       rows Extra                                 
--  -----------        ----- -------------- --------- ------- ------- ------ ------- ------------------------------------
 1  SIMPLE             a     ALL                                             1000392                                       
 1  SIMPLE             b     ref            PRIMARY   PRIMARY 4       a.id         1 Using where; Using index; Not exists

not-in
id  select_type        table type           possible_ key     key_len ref       rows Extra        
--  ------------------ ----- -------------- --------- ------- ------- ------ ------- ------------------------------------
 1  PRIMARY            a     ALL                                             1000392 Using where  
 2  DEPENDENT SUBQUERY b     index_subquery PRIMARY   PRIMARY 4       func         1 Using index 

not-exists
id  select_type        table type           possible_ key     key_len ref       rows Extra        
--  ------------------ ----- ------         --------- ------- ------- ------ ------- ------------------------------------
 1  PRIMARY            a     ALL                                             1000392 Using where  
 2  DEPENDENT SUBQUERY b     ref            PRIMARY   PRIMARY 4       a.id         1 Using index
spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • Please have a look at [***LEFT JOIN / IS NULL vs. NOT IN vs. NOT EXISTS: nullable columns***](http://planet.mysql.com/entry/?id=24888) – M Khalid Junaid May 01 '14 at 22:46
  • @M Khalid Junaid: Performance of queries depends on several factors. Availability of suitable indexes for efficient access plans, nullability of columns involved in predicates, cardinality, distribution of values, storage engine, MySQL version, server configuration, and so on. In some cases, a `NOT IN` will give the best performance, sometimes a `LEFT JOIN WHERE NULL` will give the best performance, and there's even cases where a `NOT EXISTS` will outperform both. – spencer7593 May 01 '14 at 23:50