1

After reading the many threads on ways to find the values in one table but not in another like (SQL - find records from one table which don't exist in another) I am trying to decide on the most efficient query to use.

Unfortunately the two tables I am working with do not have unique keys, are 30+ columns wide, 2-30 million rows long (usually on the 2-6 million side), and for a row to match, all values in the row must match.

So far after countless hours of searching I have found basically the following 4 queries in one form or another. The first 3 queries and tables below are referenced from the linked thread and its accepted answer:

Phone_book
+----+------+--------------+
| id | name | phone_number |
+----+------+--------------+
| 1  | John | 111111111111 |
+----+------+--------------+
| 2  | Jane | 222222222222 |
+----+------+--------------+

Call
+----+------+--------------+
| id | date | phone_number |
+----+------+--------------+
| 1  | 0945 | 111111111111 |
+----+------+--------------+
| 2  | 0950 | 222222222222 |
+----+------+--------------+
| 3  | 1045 | 333333333333 |
+----+------+--------------+

Query 1

SELECT  *
FROM    Call
WHERE   phone_number NOT IN (SELECT phone_number FROM Phone_book)

Query 2

SELECT *
FROM   Call
WHERE  NOT EXISTS
  (SELECT *
   FROM   Phone_book
   WHERE  Phone_book.phone_number = Call.phone_number)

Query 3

SELECT * 
FROM   Call
LEFT OUTER JOIN Phone_Book
  ON (Call.phone_number = Phone_book.phone_number)
  WHERE Phone_book.phone_number IS NULL

In addition to these 3 queries I have also tried this:

Query 4

SELECT a.* FROM CALL a NATURAL LEFT JOIN Phone_book b WHERE b.id IS NULL

Query 4 is unacceptably slow but here for reference. I am in the process of testing the others, however this is taking some time.

In my specific application, I will also be adding the results (rows found in Call but not in Phone_book) to their own table. This leads to three questions:

  1. Which query is fastest?

  2. Which query is fastest when forced to compare all column values in a row instead of just phone_number?

  3. Which query when combined with an insert statement will be fastest and not take up huge amounts of results buffer space?

Edit: As requested, here is some info related to my specific setup. I am starting from scratch and have removed the indexes I put in previously. For testing the queries above I had indexed F0 in OldImportResults and in NewImportResults.

Table creation statement for OldImportResults

CREATE TABLE `OldImportResults` (
  `F0` varchar(9) DEFAULT NULL,
  `F1` varchar(1) DEFAULT NULL,
  `F2` varchar(3) DEFAULT NULL,
  `F3` varchar(1) DEFAULT NULL,
  `F4` bigint(11) DEFAULT NULL,
  `F5` varchar(3) DEFAULT NULL,
  `F6` varchar(3) DEFAULT NULL,
  `F7` varchar(118) DEFAULT NULL,
  `F8` varchar(30) DEFAULT NULL,
  `F9` varchar(2) DEFAULT NULL,
  `F10` varchar(9) DEFAULT NULL,
  `F11` varchar(38) DEFAULT NULL,
  `F12` varchar(38) DEFAULT NULL,
  `F13` varchar(8) DEFAULT NULL,
  `F14` int(8) DEFAULT NULL,
  `F15` varchar(9) DEFAULT NULL,
  `F16` varchar(25) DEFAULT NULL,
  `F17` varchar(8) DEFAULT NULL,
  `F18` varchar(1) DEFAULT NULL,
  `F19` varchar(100) DEFAULT NULL,
  `F20` bigint(19) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Table creation statement for NewImportResults:

CREATE TABLE `NewImportResults` (
  `F0` varchar(11) DEFAULT NULL,
  `F1` varchar(1) DEFAULT NULL,
  `F2` varchar(3) DEFAULT NULL,
  `F3` varchar(1) DEFAULT NULL,
  `F4` bigint(11) DEFAULT NULL,
  `F5` varchar(3) DEFAULT NULL,
  `F6` varchar(3) DEFAULT NULL,
  `F7` varchar(110) DEFAULT NULL,
  `F8` varchar(30) DEFAULT NULL,
  `F9` varchar(2) DEFAULT NULL,
  `F10` varchar(9) DEFAULT NULL,
  `F11` varchar(33) DEFAULT NULL,
  `F12` varchar(34) DEFAULT NULL,
  `F13` varchar(8) DEFAULT NULL,
  `F14` int(8) DEFAULT NULL,
  `F15` varchar(9) DEFAULT NULL,
  `F16` varchar(25) DEFAULT NULL,
  `F17` varchar(8) DEFAULT NULL,
  `F18` varchar(1) DEFAULT NULL,
  `F19` varchar(100) DEFAULT NULL,
  `F20` bigint(19) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Table creation statement for HistoricImportResults

CREATE TABLE `HistoricImportResults` (
  `F0` varchar(9) DEFAULT NULL,
  `F1` varchar(1) DEFAULT NULL,
  `F2` varchar(3) DEFAULT NULL,
  `F3` varchar(1) DEFAULT NULL,
  `F4` bigint(11) DEFAULT NULL,
  `F5` varchar(3) DEFAULT NULL,
  `F6` varchar(3) DEFAULT NULL,
  `F7` varchar(118) DEFAULT NULL,
  `F8` varchar(30) DEFAULT NULL,
  `F9` varchar(2) DEFAULT NULL,
  `F10` varchar(9) DEFAULT NULL,
  `F11` varchar(38) DEFAULT NULL,
  `F12` varchar(38) DEFAULT NULL,
  `F13` varchar(8) DEFAULT NULL,
  `F14` int(8) DEFAULT NULL,
  `F15` varchar(9) DEFAULT NULL,
  `F16` varchar(25) DEFAULT NULL,
  `F17` varchar(8) DEFAULT NULL,
  `F18` varchar(1) DEFAULT NULL,
  `F19` varchar(100) DEFAULT NULL,
  `F20` bigint(19) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

I would like to get all rows that are in OldImportResults but not in NewImportResults and put them in HistoricImportResults.

Here are some explains and their results:

EXPLAIN 
SELECT * FROM OldImportResults WHERE NOT EXISTS (
SELECT * FROM NewImportResults 
WHERE OldImportResults.F0 = NewImportResults.F0
);

+----+--------------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
| id | select_type        | table                           | type | possible_keys | key  | key_len | ref  | rows    | Extra       |
+----+--------------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
|  1 | PRIMARY            | OldImportResults                | ALL  | NULL          | NULL | NULL    | NULL | 2074378 | Using where |
|  2 | DEPENDENT SUBQUERY | NewImportResults                | ALL  | NULL          | NULL | NULL    | NULL | 2074378 | Using where |
+----+--------------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+



EXPLAIN
SELECT * 
FROM OldImportResults
LEFT OUTER JOIN NewImportResults
  ON (OldImportResults.F0 = NewImportResults.F0)
  WHERE NewImportResults.F0 IS NULL;

+----+-------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table                           | type | possible_keys | key  | key_len | ref  | rows    | Extra       |
+----+-------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | OldImportResults                | ALL  | NULL          | NULL | NULL    | NULL | 2074378 |             |
|  1 | SIMPLE      | NewImportResults                | ALL  | NULL          | NULL | NULL    | NULL | 2074378 | Using where |
+----+-------------+---------------------------------+------+---------------+------+---------+------+---------+-------------+



EXPLAIN
SELECT a.*
FROM OldImportResults a
NATURAL LEFT JOIN NewImportResults b
WHERE b.F0 IS NULL;

+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | a     | ALL  | NULL          | NULL | NULL    | NULL | 2074378 |             |
|  1 | SIMPLE      | b     | ALL  | NULL          | NULL | NULL    | NULL | 2074378 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+

EDIT:

I decided to index F0 on each table as illustrated below.

SHOW INDEX FROM OldImportResults;

+----------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table                      | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| OldImportResults           |          1 | F0       |            1 | F0          | A         |        6627 |     NULL | NULL   | YES  | BTREE      |         |
+----------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+


SHOW INDEX FROM NewImportResults;

+---------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table                           | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+---------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| NewImportResults                |          1 | F0       |            1 | F0          | A         |        6627 |     NULL | NULL   | YES  | BTREE      |         |
+---------------------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

When doing an EXPLAIN on each query I got this:

EXPLAIN 
SELECT * FROM OldImportResults WHERE NOT EXISTS (
SELECT * FROM NewImportResults 
WHERE OldImportResults.F0 = NewImportResults.F0
);

+----+--------------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+--------------------------+
| id | select_type        | table                           | type | possible_keys | key  | key_len | ref                                            | rows    | Extra                    |
+----+--------------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+--------------------------+
|  1 | PRIMARY            | OldImportResults                | ALL  | NULL          | NULL | NULL    | NULL                                           | 2074378 | Using where              |
|  2 | DEPENDENT SUBQUERY | NewImportResults                | ref  | F0            | F0   | 12      | DBName.OldImportResults.F0                     |     313 | Using where; Using index |
+----+--------------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+--------------------------+


EXPLAIN
SELECT * 
FROM OldImportResults
LEFT OUTER JOIN NewImportResults
  ON (OldImportResults.F0 = NewImportResults.F0)
  WHERE NewImportResults.F0 IS NULL;

+----+-------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+-------------+
| id | select_type | table                           | type | possible_keys | key  | key_len | ref                                            | rows    | Extra       |
+----+-------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+-------------+
|  1 | SIMPLE      | OldImportResults                | ALL  | NULL          | NULL | NULL    | NULL                                           | 2074378 |             |
|  1 | SIMPLE      | NewImportResults                | ref  | F0            | F0   | 12      | DBName.OldImportResults.F0                     |     313 | Using where |
+----+-------------+---------------------------------+------+---------------+------+---------+------------------------------------------------+---------+-------------+


EXPLAIN
SELECT a.*
FROM OldImportResults a
NATURAL LEFT JOIN NewImportResults b
WHERE b.F0 IS NULL;

+----+-------------+-------+------+---------------+------+---------+-----------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref                   | rows    | Extra       |
+----+-------------+-------+------+---------------+------+---------+-----------------------+---------+-------------+
|  1 | SIMPLE      | a     | ALL  | NULL          | NULL | NULL    | NULL                  | 2074378 |             |
|  1 | SIMPLE      | b     | ref  | F0            | F0   | 12      | DBName.a.F0           |     313 | Using where |
+----+-------------+-------+------+---------------+------+---------+-----------------------+---------+-------------+

In light of the EXPLAIN results it appears that the WHERE NOT EXISTS statement will execute much more quickly as it is using an index. Execution takes 41 seconds which for my purposes is not terrible. The others aren't worth testing at this point. Does anyone think they can in any way improve execution time?

Community
  • 1
  • 1
user2395126
  • 526
  • 1
  • 7
  • 20
  • 2
    Please show the explain plan. Which query was faster? – Marcus Adams Feb 03 '15 at 16:32
  • I would be willing to bet that query #3 will be the fastest, provided that the phone number field is indexed. (Does not matter if it is not unique.) Queries #1 and #2 will only execute as fast as query #3 if the optimizer of mysql internally converts them to something akin to #3. – Mike Nakis Feb 03 '15 at 16:35
  • I will post the results after I am able to finish the tests. Preliminary tests for queries 2, 3, and 4 are running now as I don't think query 1 will do a good job comparing all columns in a row. Once I know which 2 are fastest or if two are close, I will run them each separately with nothing else running on the server. If anyone knows how the results cache will be affected by enclosing each in an insert statement, I would really appreciate some input. – user2395126 Feb 03 '15 at 17:07
  • Have you given thought to creating a surrogate hash column on each table which can be indexed and reflect the uniqueness of the row for purposes of comparison to other rows? – Mike Brant Feb 03 '15 at 17:10
  • To optimize a query, **we need to see the table and index definitions**, as well as row counts for each of the tables. Maybe your tables are defined poorly. Maybe the indexes aren't created correctly. Maybe you don't have an index on that column you thought you did. Without seeing the table and index definitions, we can't tell. We also need row counts because that can affect query optimization greatly. If you know how to do an `EXPLAIN` or get an execution plan, put the results in the question as well. If you have no indexes, visit http://use-the-index-luke.com ASAP. – Andy Lester Feb 03 '15 at 17:12
  • @MikeBrant yes, I did consider doing this. The problem is that there could be repeats of rows. – user2395126 Feb 03 '15 at 17:17
  • @AndyLester I am trying to keep things at the hypothetical stage so that the question is useful for as many people as possible. For my own situation I would say average 30 columns per row, 4-6 million rows per table, maximum 30 million rows per table. Possible duplicate entries, and no keys. Also, no connections to other tables. Indexing column(s) will be a major factor. For my tests, I am just indexing my equivalent of phone_number and id for query 4 for now. Will post EXPLAINs after the tests are done. Also, very nice link - should be required reading for all MySQL developers. – user2395126 Feb 03 '15 at 17:30
  • Why would that the rows not being unique be a problem. At least you could have a hash that you could index and use for `SELECT DISTINCT` as well as your joins. – Mike Brant Feb 03 '15 at 17:31
  • Hypothetical is not necessarily useful when it comes to query optimizations. The specific of your schema (column types, indexes, relationships between tables) and your data (number of rows, cardinality, etc.) will greatly impact how to optimize a query. – Mike Brant Feb 03 '15 at 17:33
  • @MikeBrant ah I see! That is a very good idea. Let me throw something together really quick to test the hash and SELECT DISTINCT method. I will also post more data relevant to my particular situation. – user2395126 Feb 03 '15 at 17:37
  • @MikeBrant I added the table creation statements as well as the results of some explains. I removed the previous indexes. Please let me know if you would like to see anything else. – user2395126 Feb 03 '15 at 18:58

0 Answers0