After reading the many threads on ways to find the values in one table but not in another like (SQL - find records from one table which don't exist in another) I am trying to decide on the most efficient query to use.
Unfortunately the two tables I am working with do not have unique keys, are 30+ columns wide, 2-30 million rows long (usually on the 2-6 million side), and for a row to match, all values in the row must match.
So far after countless hours of searching I have found basically the following 4 queries in one form or another. The first 3 queries and tables below are referenced from the linked thread and its accepted answer:
Phone_book
+----+------+--------------+
| id | name | phone_number |
+----+------+--------------+
| 1 | John | 111111111111 |
+----+------+--------------+
| 2 | Jane | 222222222222 |
+----+------+--------------+
Call
+----+------+--------------+
| id | date | phone_number |
+----+------+--------------+
| 1 | 0945 | 111111111111 |
+----+------+--------------+
| 2 | 0950 | 222222222222 |
+----+------+--------------+
| 3 | 1045 | 333333333333 |
+----+------+--------------+
Query 1
SELECT *
FROM Call
WHERE phone_number NOT IN (SELECT phone_number FROM Phone_book)
Query 2
SELECT *
FROM Call
WHERE NOT EXISTS
(SELECT *
FROM Phone_book
WHERE Phone_book.phone_number = Call.phone_number)
Query 3
SELECT *
FROM Call
LEFT OUTER JOIN Phone_Book
ON (Call.phone_number = Phone_book.phone_number)
WHERE Phone_book.phone_number IS NULL
In addition to these 3 queries I have also tried this:
Query 4
SELECT a.* FROM CALL a NATURAL LEFT JOIN Phone_book b WHERE b.id IS NULL
Query 4 is unacceptably slow but here for reference. I am in the process of testing the others, however this is taking some time.
In my specific application, I will also be adding the results (rows found in Call but not in Phone_book) to their own table. This leads to three questions:
Which query is fastest?
Which query is fastest when forced to compare all column values in a row instead of just phone_number?
Which query when combined with an insert statement will be fastest and not take up huge amounts of results buffer space?
Edit: As requested, here is some info related to my specific setup. I am starting from scratch and have removed the indexes I put in previously. For testing the queries above I had indexed F0 in OldImportResults and in NewImportResults.
Table creation statement for OldImportResults
CREATE TABLE `OldImportResults` (
`F0` varchar(9) DEFAULT NULL,
`F1` varchar(1) DEFAULT NULL,
`F2` varchar(3) DEFAULT NULL,
`F3` varchar(1) DEFAULT NULL,
`F4` bigint(11) DEFAULT NULL,
`F5` varchar(3) DEFAULT NULL,
`F6` varchar(3) DEFAULT NULL,
`F7` varchar(118) DEFAULT NULL,
`F8` varchar(30) DEFAULT NULL,
`F9` varchar(2) DEFAULT NULL,
`F10` varchar(9) DEFAULT NULL,
`F11` varchar(38) DEFAULT NULL,
`F12` varchar(38) DEFAULT NULL,
`F13` varchar(8) DEFAULT NULL,
`F14` int(8) DEFAULT NULL,
`F15` varchar(9) DEFAULT NULL,
`F16` varchar(25) DEFAULT NULL,
`F17` varchar(8) DEFAULT NULL,
`F18` varchar(1) DEFAULT NULL,
`F19` varchar(100) DEFAULT NULL,
`F20` bigint(19) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Table creation statement for NewImportResults:
CREATE TABLE `NewImportResults` (
`F0` varchar(11) DEFAULT NULL,
`F1` varchar(1) DEFAULT NULL,
`F2` varchar(3) DEFAULT NULL,
`F3` varchar(1) DEFAULT NULL,
`F4` bigint(11) DEFAULT NULL,
`F5` varchar(3) DEFAULT NULL,
`F6` varchar(3) DEFAULT NULL,
`F7` varchar(110) DEFAULT NULL,
`F8` varchar(30) DEFAULT NULL,
`F9` varchar(2) DEFAULT NULL,
`F10` varchar(9) DEFAULT NULL,
`F11` varchar(33) DEFAULT NULL,
`F12` varchar(34) DEFAULT NULL,
`F13` varchar(8) DEFAULT NULL,
`F14` int(8) DEFAULT NULL,
`F15` varchar(9) DEFAULT NULL,
`F16` varchar(25) DEFAULT NULL,
`F17` varchar(8) DEFAULT NULL,
`F18` varchar(1) DEFAULT NULL,
`F19` varchar(100) DEFAULT NULL,
`F20` bigint(19) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Table creation statement for HistoricImportResults
CREATE TABLE `HistoricImportResults` (
`F0` varchar(9) DEFAULT NULL,
`F1` varchar(1) DEFAULT NULL,
`F2` varchar(3) DEFAULT NULL,
`F3` varchar(1) DEFAULT NULL,
`F4` bigint(11) DEFAULT NULL,
`F5` varchar(3) DEFAULT NULL,
`F6` varchar(3) DEFAULT NULL,
`F7` varchar(118) DEFAULT NULL,
`F8` varchar(30) DEFAULT NULL,
`F9` varchar(2) DEFAULT NULL,
`F10` varchar(9) DEFAULT NULL,
`F11` varchar(38) DEFAULT NULL,
`F12` varchar(38) DEFAULT NULL,
`F13` varchar(8) DEFAULT NULL,
`F14` int(8) DEFAULT NULL,
`F15` varchar(9) DEFAULT NULL,
`F16` varchar(25) DEFAULT NULL,
`F17` varchar(8) DEFAULT NULL,
`F18` varchar(1) DEFAULT NULL,
`F19` varchar(100) DEFAULT NULL,
`F20` bigint(19) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
I would like to get all rows that are in OldImportResults but not in NewImportResults and put them in HistoricImportResults.
Here are some explains and their results:
EXPLAIN
SELECT * FROM OldImportResults WHERE NOT EXISTS (
SELECT * FROM NewImportResults
WHERE OldImportResults.F0 = NewImportResults.F0
);
+----+--------------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
| 1 | PRIMARY | OldImportResults | ALL | NULL | NULL | NULL | NULL | 2074378 | Using where |
| 2 | DEPENDENT SUBQUERY | NewImportResults | ALL | NULL | NULL | NULL | NULL | 2074378 | Using where |
+----+--------------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
EXPLAIN
SELECT *
FROM OldImportResults
LEFT OUTER JOIN NewImportResults
ON (OldImportResults.F0 = NewImportResults.F0)
WHERE NewImportResults.F0 IS NULL;
+----+-------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | OldImportResults | ALL | NULL | NULL | NULL | NULL | 2074378 | |
| 1 | SIMPLE | NewImportResults | ALL | NULL | NULL | NULL | NULL | 2074378 | Using where |
+----+-------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
EXPLAIN
SELECT a.*
FROM OldImportResults a
NATURAL LEFT JOIN NewImportResults b
WHERE b.F0 IS NULL;
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | a | ALL | NULL | NULL | NULL | NULL | 2074378 | |
| 1 | SIMPLE | b | ALL | NULL | NULL | NULL | NULL | 2074378 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
EDIT:
I decided to index F0 on each table as illustrated below.
SHOW INDEX FROM OldImportResults;
+----------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| OldImportResults | 1 | F0 | 1 | F0 | A | 6627 | NULL | NULL | YES | BTREE | |
+----------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
SHOW INDEX FROM NewImportResults;
+---------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+---------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| NewImportResults | 1 | F0 | 1 | F0 | A | 6627 | NULL | NULL | YES | BTREE | |
+---------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
When doing an EXPLAIN on each query I got this:
EXPLAIN
SELECT * FROM OldImportResults WHERE NOT EXISTS (
SELECT * FROM NewImportResults
WHERE OldImportResults.F0 = NewImportResults.F0
);
+----+--------------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+--------------------------+
| 1 | PRIMARY | OldImportResults | ALL | NULL | NULL | NULL | NULL | 2074378 | Using where |
| 2 | DEPENDENT SUBQUERY | NewImportResults | ref | F0 | F0 | 12 | DBName.OldImportResults.F0 | 313 | Using where; Using index |
+----+--------------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+--------------------------+
EXPLAIN
SELECT *
FROM OldImportResults
LEFT OUTER JOIN NewImportResults
ON (OldImportResults.F0 = NewImportResults.F0)
WHERE NewImportResults.F0 IS NULL;
+----+-------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+-------------+
| 1 | SIMPLE | OldImportResults | ALL | NULL | NULL | NULL | NULL | 2074378 | |
| 1 | SIMPLE | NewImportResults | ref | F0 | F0 | 12 | DBName.OldImportResults.F0 | 313 | Using where |
+----+-------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+-------------+
EXPLAIN
SELECT a.*
FROM OldImportResults a
NATURAL LEFT JOIN NewImportResults b
WHERE b.F0 IS NULL;
+----+-------------+-------+------+---------------+------+---------+-----------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+-----------------------+---------+-------------+
| 1 | SIMPLE | a | ALL | NULL | NULL | NULL | NULL | 2074378 | |
| 1 | SIMPLE | b | ref | F0 | F0 | 12 | DBName.a.F0 | 313 | Using where |
+----+-------------+-------+------+---------------+------+---------+-----------------------+---------+-------------+
In light of the EXPLAIN results it appears that the WHERE NOT EXISTS statement will execute much more quickly as it is using an index. Execution takes 41 seconds which for my purposes is not terrible. The others aren't worth testing at this point. Does anyone think they can in any way improve execution time?