1

I have two tables, one with about 700 rows, and another with just over 100,000 rows.

My query is to find all of the rows in the first table where the email address doesn't exist in the 2nd table.

SELECT a.* 
FROM `csv_import_temp` a FORCE INDEX FOR JOIN (`imp_email`)
LEFT JOIN `xlefz_mailer_Contacts` b FORCE INDEX FOR JOIN (`idx_contact_email`)
ON a.`contact_email` like b.`contact_email`
WHERE b.`contact_email` is null

There is an index on contact_email in both tables. The explain for the query shows that only one of the indexes is used, even after I added the FORCE INDEX clauses.

id  | select_type  | table  | type   | possible_keys  | key                | key_len  | ref  | rows    | Extra                    | 
 1  | SIMPLE       | a      | ALL    | NULL           | NULL               | NULL     | NULL | 292     | 
 1  | SIMPLE       | b      | index  | NULL           | idx_contact_email  | 228      | NULL | 106149  | Using where; Using index

Currently the query takes just over 6 seconds to run.

How can I get the query to use both of the indexes?

1st Table:

CREATE TABLE `csv_import_temp` (
 `id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT 'Primary Key',
 `contact_lname` varchar(50) COLLATE utf8_unicode_ci NOT NULL DEFAULT '' COMMENT 'Last or Family name',
 `contact_email` varchar(75) COLLATE utf8_unicode_ci DEFAULT '' COMMENT 'Email Address',
 `contact_state` varchar(20) COLLATE utf8_unicode_ci DEFAULT '' COMMENT 'State or provence',
 `contact_country` varchar(25) CHARACTER SET utf8 DEFAULT 'USA' COMMENT 'Country Name or Code',
 `contact_oktoemail` enum('Yes','No') COLLATE utf8_unicode_ci DEFAULT 'Yes' COMMENT 'Yes if OK to send Email',
 `contact_oktofax` enum('Yes','No') COLLATE utf8_unicode_ci DEFAULT 'Yes' COMMENT 'Yes of Ok to send Fax',
 `contact_oktomail` enum('Yes','No') COLLATE utf8_unicode_ci DEFAULT 'Yes' COMMENT 'Yes of OK to send mail',
 `contact_status` enum('Ok','Unsub','Bounced','BouncedUpdated','isAOL') COLLATE utf8_unicode_ci DEFAULT 'Ok' COMMENT 'Ok - can contact via email, Unsub if unsubscribe request has been received. Bounced = Mail sent to email address bounced.',
 PRIMARY KEY (`id`),
 KEY `imp_email` (`contact_email`)
) ENGINE=MyISAM AUTO_INCREMENT=293 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

2nd Table:

CREATE TABLE `xlefz_mailer_Contacts` (
 `contact_id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT 'Primary Key',
 `contact_created` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT 'Timestamp when record was created',
 `contact_modified` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT 'Timestamp when record was last modified',
 `manual_update` tinyint(1) NOT NULL DEFAULT '0' COMMENT 'If 1, has been manually updated',
 `manual_updateDT` timestamp NULL DEFAULT NULL COMMENT 'Timestamp of last manual update',
 `contact_import_id` int(11) DEFAULT NULL,
 `contact_company` varchar(100) COLLATE utf8_unicode_ci NOT NULL DEFAULT '' COMMENT 'Company Name',
 `contact_title` varchar(20) CHARACTER SET utf8 DEFAULT NULL COMMENT 'Mr., Ms., Dr., etc.',
 `contact_fname` varchar(50) CHARACTER SET utf8 NOT NULL DEFAULT '' COMMENT 'First or given name',
 `contact_lname` varchar(50) COLLATE utf8_unicode_ci NOT NULL DEFAULT '' COMMENT 'Last or Family name',
 `contact_email` varchar(75) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT 'Email Address',
 `contact_email_name` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT 'The mart of an email address before the @',
 `contact_email_domain` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT 'The mart of an email address after the @',
 `contact_addr1` varchar(50) COLLATE utf8_unicode_ci DEFAULT '' COMMENT '1st line of mailing address',
 `contact_addr2` varchar(50) COLLATE utf8_unicode_ci DEFAULT '' COMMENT 'Apartment, suite, etc.',
 `contact_city` varchar(50) COLLATE utf8_unicode_ci DEFAULT '' COMMENT 'City',
 `contact_state` varchar(20) COLLATE utf8_unicode_ci DEFAULT '' COMMENT 'State or provence',
 `contact_zip` varchar(15) COLLATE utf8_unicode_ci DEFAULT '' COMMENT 'Postal code',
 `contact_country` varchar(25) CHARACTER SET utf8 DEFAULT NULL COMMENT 'Country Name or Code',
 `contact_phone` varchar(20) COLLATE utf8_unicode_ci DEFAULT '' COMMENT 'Voice Phone',
 `contact_fax` varchar(20) COLLATE utf8_unicode_ci DEFAULT '' COMMENT 'Fax Phone',
 `contact_oktoemail` enum('Yes','No') COLLATE utf8_unicode_ci DEFAULT 'Yes' COMMENT 'Yes if OK to send Email',
 `contact_oktofax` enum('Yes','No') COLLATE utf8_unicode_ci DEFAULT 'Yes' COMMENT 'Yes of Ok to send Fax',
 `contact_oktomail` enum('Yes','No') COLLATE utf8_unicode_ci DEFAULT 'Yes' COMMENT 'Yes of OK to send mail',
 `contact_notes` varchar(1000) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT 'Notes regarding contact.',
 `contact_status` enum('Ok','Unsub','Bounced','BouncedUpdated','isAOL') COLLATE utf8_unicode_ci DEFAULT 'Ok' COMMENT 'Ok - can contact via email, Unsub if unsubscribe request has been received. Bounced = Mail sent to email address bounced. BouncedUpdated - indicates that the contact has been copied and updated with a new email address',
 PRIMARY KEY (`contact_id`),
 KEY `idx_contact_email` (`contact_email`),
 KEY `idx_contact_city` (`contact_city`),
 KEY `idx_contact_country` (`contact_country`),
 KEY `idx_contact_oktoemail` (`contact_oktoemail`),
 KEY `idx_contact_oktofax` (`contact_oktofax`),
 KEY `idx_contact_oktomail` (`contact_oktomail`),
 KEY `contact_import_id` (`contact_import_id`),
 KEY `idx_contact_oktoemailstatus` (`contact_oktoemail`,`contact_status`),
 KEY `contact_email_name` (`contact_email_name`,`contact_email_domain`),
 KEY `idx_mailer_contacts_manualupdate` (`manual_update`),
 KEY `idx_mailer_contact_manualupdateDT` (`manual_updateDT`)
) ENGINE=MyISAM AUTO_INCREMENT=128829 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Sloan Thrasher
  • 4,953
  • 3
  • 22
  • 40
  • Any question about query optimization should include the result of `SHOW CREATE TABLE ` for each table in the query, so we can see the definition of your columns and indexes. – Bill Karwin Jun 03 '18 at 15:01
  • Replace `like` with `=` (assuming you do not acually want to compare using wildcards). – Solarflare Jun 03 '18 at 15:06
  • I need to ignore case on the values. I tried using lower() on each and =, but that took even longer to execute. – Sloan Thrasher Jun 03 '18 at 15:08
  • @BillKarwin, I've added the CREATE TABLE statements to the question. – Sloan Thrasher Jun 03 '18 at 15:11
  • @SloanThrasher use COLLATE to do a case sensitive comparison. See https://stackoverflow.com/a/5629129/3361634 – Tim B. Jun 03 '18 at 15:13
  • @TimB. I need to do a case insensitive comparison, not case-sensitive. – Sloan Thrasher Jun 03 '18 at 15:16
  • Can you check performance if where conditions are ignored, I mean how much time does the normal join take?? Coz, you have used `varchar(75)` which helps database indexing for search operations, that's a good thing ! – miiiii Jun 03 '18 at 15:20
  • @MadMan I changed the query removing the ```left``` from the join, and removing the where clause so it returned all rows where the email address was found. That query took over 10 seconds – Sloan Thrasher Jun 03 '18 at 15:27
  • @SloanThrasher Is changing your table structure (including adding a new table) acceptable in a solution for you? – Willem Renzema Jun 03 '18 at 15:34
  • @WillemRenzema I can change the structure of the 1st table ```csv_import_temp```, and possibly the 2nd, if the change is adding an index or similar. What table would you add and why? The 1st table is created for the import, and then dropped. – Sloan Thrasher Jun 03 '18 at 15:40
  • 1
    @SloanThrasher The reason I ask is because the best option, in my experience (which is more than most but less than some, especially the some that frequent here) is that since you are joining on a column that is 228 bytes wide you are never going to see good performance for this number of records, and so you should separate the emails out into a separate table, with an INT for a primary key. That will drop the index from 228 bytes to 4 bytes, which is about a 98% improvement. Smaller index size means more data can be pulled in at once, requiring less disk reads and so speeding up performance. – Willem Renzema Jun 03 '18 at 15:44
  • @WillemRenzema, so a VARCHAR(75) column takes 228 bytes? – Sloan Thrasher Jun 03 '18 at 15:50
  • @SloanThrasher Yes, you can see that in the `key_len` column of your `EXPLAIN` plan. You can compute this yourself, as `utf8` is 3 bytes per character, (unlike the real UTF8 column type `utf8mb4`), plus a little extra as a prefix (I'm unclear on why the extra is 3 and not 1, but that is irrelevant to this question.) 75*3 + 3 = 225+3 = 228. – Willem Renzema Jun 03 '18 at 16:36
  • 1
    @WillemRenzema - The query is not lightly to speed up much by switching from `VARCHAR` to `INT`. This is a common misconception. – Rick James Jun 04 '18 at 00:48
  • 1
    @WillemRenzema - Since it is `VAR`, it is _up to_ that many bytes. The empty string would occupy only 1 or 2 bytes (depending whether 1 or 2 bytes is allocated for the length). You might expect "emails" to average about 40 bytes -- almost always ascii (1 byte per char) and not many chars. – Rick James Jun 04 '18 at 00:51
  • 1
    Why 228? -- `EXPLAIN` is sloppy. It assumes a 2-byte length, even when it when it is actually 1. And it throws in 1 for `NULLability`. Hence, the extra 3 you see. – Rick James Jun 04 '18 at 00:54

1 Answers1

4

To clear up some misconceptions in the question:

  1. In an indexed nested-loops join, an index will only be used for the second table of the join. Hence, specifying FORCE INDEX FOR JOIN on the first table, will have no effect.
  2. When EXPLAIN says that type is "index", it does not mean that the index is used for join look-ups. "index" indicates a full index scan. The index is used in this case because it contains all columns needed for this table. In other words, it is a covering index. ("Using index" in the Extra column of EXPLAIN really means "using index only")
  3. When a join condition is based on LIKE, an index can not be used. The columns may contain wild-cards that would not be possible to find matches for with an index look-up.
  4. Comparisons using = is not necessarily case-sensitive. That will depend on the collation in use. In your case, both the involved columns are defined with a case-insensitive collation (utf8_unicode_ci). Hence, comparison will be case-insensitive.
  5. The use of MyISAM ;-)

To sum up, you should use equality for the join condition, or you could rewrite the query to the following, IMHO, simpler query:

SELECT * 
FROM csv_import_temp
WHERE contact_email NOT IN (
    SELECT contact_email FROM xlefz_mailer_Contacts);
Øystein Grøvlen
  • 1,266
  • 6
  • 8