I have a table with 2 columns email
and id
. I need to find emails that are closely related. For example:
john.smith12@example.com
and
john.smith12@some.subdomains.example.com
These should be considered the same because the username (john.smith12
) and the most top level domain (example.com
) are the same. They are currently 2 different rows in my table. I've written the below expression which should do that comparison but it takes hours to execute (possibly/probably because of regex). Is there a better way to write this:
select c1.email, c2.email
from table as c1
join table as c2
on (
c1.leadid <> c2.leadid
and
c1.email regexp replace(replace(c2.email, '.', '[.]'), '@', '@[^@]*'))
The explain of this query comes back as:
id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, ALL, NULL, NULL, NULL, NULL, 577532, NULL
1, SIMPLE, c2, ALL, NULL, NULL, NULL, NULL, 577532, Using where; Using join buffer (Block Nested Loop)
The create table is:
CREATE TABLE `table` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Email` varchar(100) DEFAULT NULL,
KEY `Table_Email` (`Email`),
KEY `Email` (`Email`)
) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1
I guess the indices aren't being used because of the regexp.
The regex comes out as:
john[.]smith12@[^@]*example[.]com
which should match both addresses.
Update:
I've modified the on
to be:
on (c1.email <> '' and c2.email <> '' and c1.leadid <> c2.leadid and substr(c1. email, 1, (locate('@', c1.email) -1)) = substr(c2. email, 1, (locate('@', c2.email) -1))
and
substr(c1.email, locate('@', c1.email) + 1) like concat('%', substr(c2.email, locate('@', c2.email) + 1)))
and the explain
with this approach is at least using the indices.
id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index
1, SIMPLE, c2, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index; Using join buffer (Block Nested Loop)
So far this has executed for 5 minutes, will update if there is a vast improvement.
Update 2:
I've split the email so the username is a column and domain is a column. I've stored the domain in reverse order so the index of it can be used with a trailing wildcard.
CREATE TABLE `table` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Email` varchar(100) DEFAULT NULL,
`domain` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
`username` varchar(500) CHARACTER SET utf8 DEFAULT NULL,
KEY `Table_Email` (`Email`),
KEY `Email` (`Email`),
KEY `domain` (`domain`)
) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1
Query to populate new columns:
update table
set username = trim(SUBSTRING_INDEX(trim(email), '@', 1)),
domain = reverse(trim(SUBSTRING_INDEX(SUBSTRING_INDEX(trim(email), '@', -1), '.', -3)));
New query:
select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
join table as c2
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))
New Explain Results:
1, SIMPLE, c1, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)
From that explain it looks like the domain
index is not being used. I also tried to force the usage with USE
but that also didn't work, that resulted in no indices being used:
select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
USE INDEX (domain)
join table as c2
USE INDEX (domain)
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))
Explain with use
:
1, SIMPLE, c1, ALL, NULL, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, NULL, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)