Mysql Matching "Same" Emails

Question

I have a table with 2 columns email and id. I need to find emails that are closely related. For example:

john.smith12@example.com

and

john.smith12@some.subdomains.example.com

These should be considered the same because the username (john.smith12) and the most top level domain (example.com) are the same. They are currently 2 different rows in my table. ~~I've written the below expression which should do that comparison but it takes hours to execute (possibly/probably because of regex). Is there a better way to write this:~~

  select c1.email, c2.email 
  from table as c1
  join table as c2
   on (
             c1.leadid <> c2.leadid 
        and 
             c1.email regexp replace(replace(c2.email, '.', '[.]'), '@', '@[^@]*'))

The explain of this query comes back as:

id, select_type, table, type, possible_keys, key, key_len, ref,  rows,   Extra
1,  SIMPLE,      c1,    ALL,   NULL,         NULL,  NULL,  NULL, 577532, NULL
1,  SIMPLE,      c2,    ALL,   NULL,         NULL,  NULL,  NULL, 577532, Using where; Using join buffer (Block Nested Loop)

The create table is:

CREATE TABLE `table` (
 `ID` int(11) NOT NULL AUTO_INCREMENT,
 `Email` varchar(100) DEFAULT NULL,
 KEY `Table_Email` (`Email`),
 KEY `Email` (`Email`)
) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1

I guess the indices aren't being used because of the regexp.

The regex comes out as:

john[.]smith12@[^@]*example[.]com

which should match both addresses.

Update:

I've modified the on to be:

on (c1.email <> '' and c2.email <> '' and c1.leadid <> c2.leadid and substr(c1. email, 1, (locate('@', c1.email) -1)) = substr(c2. email, 1, (locate('@', c2.email) -1))
and    
substr(c1.email, locate('@', c1.email) + 1) like concat('%', substr(c2.email, locate('@', c2.email) + 1)))

and the explain with this approach is at least using the indices.

id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index
1, SIMPLE, c2, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index; Using join buffer (Block Nested Loop)

So far this has executed for 5 minutes, will update if there is a vast improvement.

Update 2:

I've split the email so the username is a column and domain is a column. I've stored the domain in reverse order so the index of it can be used with a trailing wildcard.

CREATE TABLE `table` (
     `ID` int(11) NOT NULL AUTO_INCREMENT,
     `Email` varchar(100) DEFAULT NULL,
     `domain` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
     `username` varchar(500) CHARACTER SET utf8 DEFAULT NULL,
     KEY `Table_Email` (`Email`),
     KEY `Email` (`Email`),
     KEY `domain` (`domain`)
    ) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1

Query to populate new columns:

update table
set username = trim(SUBSTRING_INDEX(trim(email), '@', 1)), 
domain = reverse(trim(SUBSTRING_INDEX(SUBSTRING_INDEX(trim(email), '@', -1), '.', -3)));

New query:

select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
join table as c2
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
    and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))

New Explain Results:

1, SIMPLE, c1, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)

From that explain it looks like the domain index is not being used. I also tried to force the usage with USE but that also didn't work, that resulted in no indices being used:

select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
USE INDEX (domain)
join table as c2
USE INDEX (domain)
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
    and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))

Explain with use:

1, SIMPLE, c1, ALL, NULL, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, NULL, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)

Related: https://stackoverflow.com/questions/12318083/mysql-optimization-for-regexp — Barmar, Jul 25 '18 at 19:32
`%` at the beginning of a `LIKE` pattern prevents it from using an index. You want the pattern to be `john.smith@%` — Barmar, Jul 25 '18 at 19:57
What makes you think those emails are "considered the same"? They are not. — , Jul 25 '18 at 20:00
String indexes are only useful when you're matching the beginning of the string. — Barmar, Jul 25 '18 at 20:01
@duskwuff For our purposes they are. If you have the same username and are at the same domain you are the same person. The domains we have vary structure wise and append departments before the top level. — user3783243, Jul 25 '18 at 20:01
Maybe you could use a generated column that holds a canonical version of the email. — Barmar, Jul 25 '18 at 20:02
Yes, something like `WHERE c1.email LIKE CONCAT(SUBSTR(c2.email, 1, POSITION(c2.email, '@')), '%') AND ...` — Barmar, Jul 25 '18 at 20:07

score 2 · Accepted Answer · answered Aug 08 '18 at 05:33

You told us that the table has 700K rows.

This is not much, but you are joining it to itself, so in the worst case the engine would have to process 700K * 700K = 490 000 000 000 = 490B rows.

An index can definitely help here.

The best index depends on the data distribution.

What does the following query return?

SELECT COUNT(DISTINCT username) 
FROM table

If result is close to 700K, say 100K, then it means that there are a lot of different usernames and you'd better focus on them, rather than domain. If result is low, say, 100, than indexing username is unlikely to be useful.

I hope that there are a lot of different usernames, so, I'd create an index on username, since the query joins on that column using simple equality comparison and this join would greatly benefit from this index.

Another option to consider is a composite index on (username, domain) or even covering index on (username, domain, leadid, email). The order of columns in the index definition is important.

I'd delete all other indexes, so that optimiser can't make another choice, unless there are other queries that may need them.

Most likely it won't hurt to define a primary key on the table as well.

There is one more not so important thing to consider. Does your data really have NULLs? If not, define the columns as NOT NULL. Also, in many cases it is better to have empty strings, rather than NULLs, unless you have very specific requirements and you have to distinguish between NULL and ''.

The query would be slightly simpler:

select 
    c1.email, c2.email, 
    c1.domain, c2.domain, 
    c1.username, c2.username, 
    c1.leadid, c2.leadid
from 
    table as c1
    join table as c2
        on  c1.username = c2.username 
        and c1.domain like concat(c2.domain, '%')
        and c1.leadid <> c2.leadid

There are 560739 distinct records. Indexing the username and dropping the domain index seems to have done the trick. Currently, yes there are NULLs, I think converting to empty strings makes more sense though so I also will do that. Thanks. I'm going to run some more tests just to confirm this works and then will accept. — user3783243, Aug 08 '18 at 12:26
This works for me. Why does having the index on the domain hurt the query? I would have thought having an index on the wildcarded column would help it. — user3783243, Aug 08 '18 at 19:13
@user3783243, it is not that index on `domain` hurts the query; it doesn't help. Maybe MySQL is not smart enough to use index when joining with `LIKE`. It is much easier to use index when joining with `=`. Also, data distribution tells us that it is a good option. Since there are 560K distinct usernames out of 700K rows, it means that only few usernames have more than one domain, i.e. one more row to check after the `=` comparison found the matching username in the self-join. — Vladimir Baranov, Aug 08 '18 at 23:27

score 1 · Answer 2 · answered Jul 25 '18 at 21:13

1

No REGEXP_REPLACE needed, so it will work in all versions of MySQL/MariaDB:

UPDATE tbl
    SET email = CONCAT(SUBSTRING_INDEX(email, '@', 1),
                       '@',
                       SUBSTRING_INDEX(
                           SUBSTRING_INDEX(email, '@', -1),
                           '.',
                           -2);

Since no index is useful, you may as well not bother with a WHERE clause.

answered Jul 25 '18 at 21:13

Rick James

135,179
13
127
222

This makes the table more organized but the `select` still takes a long time and the index doesn't appear to be used. I killed the latest select (update 2) after +66,000 seconds of execution time. – user3783243 Aug 07 '18 at 14:32

Anthony BONNIER · Answer 3 · 2018-08-08T10:56:12.470

If you search related data, you should have look to some data mining tools or Elastic Search for instance, which work like you need.

I have another possible "database-only" solution, but I don't know if it would work or if it'd be the best solution. If I have had to do this, I would try to make a table of "word references", filled by splitting all emails by all non alphanumerical characters.

In your example, this table would be filled with : john, smith12, some, subdomains, example and com. Each word with a unique id. Then, another table, a union table, which would link the email with its own words. Indexes would be needed on both tables.

To search closely related emails, you would have to split the source email with a regex and loop on each sub-word, like this one in the answer (with the connected by), then for each word, find it in the word references table, then the union table to find the emails which match it.

Over this request, you could make a select which sums all matched emails, by grouping by email to count the number of words matched by found emails and keep only the most matched email (excluding the source one, of course).

And sorry for this "not-sure-answer", but it was too long for a comment. I'm going to try to make an example.

Here is an example (in oracle, but should work with MySQL) with some data:

---------------------------------------------
-- Table containing emails and people info
CREATE TABLE PEOPLE (
     ID NUMBER(11) PRIMARY KEY NOT NULL,
     EMAIL varchar2(100) DEFAULT NULL,
     USERNAME varchar2(500) DEFAULT NULL
);

-- Table containing word references
CREATE TABLE WORD_REF (
     ID number(11) NOT NULL PRIMARY KEY,
     WORD varchar2(20) DEFAULT NULL
);

-- Table containg id's of both previous tables
CREATE TABLE UNION_TABLE (
     EMAIL_ID number(11) NOT NULL,
     WORD_ID number(11) NOT NULL,
     CONSTRAINT EMAIL_FK FOREIGN KEY (EMAIL_ID) REFERENCES PEOPLE (ID),
     CONSTRAINT WORD_FK FOREIGN KEY (WORD_ID) REFERENCES WORD_REF (ID)
);

-- Here is my oracle sequence to simulate the auto increment
CREATE SEQUENCE MY_SEQ
  MINVALUE 1
  MAXVALUE 999999
  START WITH 1
  INCREMENT BY 1
  CACHE 20;

---------------------------------------------
-- Some data in the people table
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.smith12@example.com', 'jsmith12');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.smith12@some.subdomains.example.com', 'admin');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.doe@another.domain.eu', 'jdo');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'nathan.smith@example.domain.com', 'nsmith');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'david.cayne@some.domain.st', 'davidcayne');
COMMIT;

-- Word reference data from the people data
INSERT INTO WORD_REF (ID, WORD) 
  (select MY_SEQ.NEXTVAL, WORD FROM
   (select distinct REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) WORD
    from PEOPLE
    CONNECT BY REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) IS NOT NULL
  ));
COMMIT;

-- Union table filling
INSERT INTO UNION_TABLE (EMAIL_ID, WORD_ID)
select words.ID EMAIL_ID, word_ref.ID WORD_ID
FROM 
(select distinct ID, REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) WORD
 from PEOPLE
 CONNECT BY REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) IS NOT NULL) words
left join WORD_REF on word_ref.word = words.WORD;
COMMIT;    

---------------------------------------------
-- Finaly, the request which orders the emails which match the source email 'john.smith12@example.com'
SELECT COUNT(1) email_match
      ,email
FROM   (SELECT word_ref.id
              ,words.word
              ,uni.email_id
              ,ppl.email
        FROM   (SELECT DISTINCT regexp_substr('john.smith12@example.com'
                                             ,'\w+'
                                             ,1
                                             ,LEVEL) word
                FROM   dual
                CONNECT BY regexp_substr('john.smith12@example.com'
                                        ,'\w+'
                                        ,1
                                        ,LEVEL) IS NOT NULL) words
        LEFT   JOIN word_ref
        ON     word_ref.word = words.word
        LEFT   JOIN union_table uni
        ON     uni.word_id = word_ref.id
        LEFT   JOIN people ppl
        ON     ppl.id = uni.email_id)
WHERE  email <> 'john.smith12@example.com'
GROUP  BY email_match DESC;

The request results :

    4    john.smith12@some.subdomains.example.com
    2    nathan.smith@example.domain.com
    1    john.doe@another.domain.eu

Thorsten Kettner · Answer 4 · 2018-08-08T11:55:58.227

You get the name (i.e. the part before '@') with

substring_index(email, '@', 1)

You get the domain with

substring_index(replace(email, '@', '.'), '.', -2))

(because if we substitute the '@' with a dot, then it's always the part after the second-to-last dot).

Hence you find duplicates with

select *
from users
where exists
(
  select *
  from mytable other
  where other.id <> users.id
    and substring_index(other.email, '@', 1) = 
        substring_index(users.email, '@', 1)
    and substring_index(replace(other.email, '@', '.'), '.', -2) =
        substring_index(replace(users.email, '@', '.'), '.', -2)
);

If this is too slow, then you may want to create a computed column on the two combined and index it:

alter table users add main_email as 
  concat(substring_index(email, '@', 1), '@', substring_index(replace(email, '@', '.'), '.', -2));

create index idx on users(main_email);

select *
from users
where exists
(
  select *
  from mytable other
  where other.id <> users.id
    and other.main_email = users.main_email
);

Of course you can just as well have the two separated and index them:

alter table users add email_name as substring_index(email, '@', 1);
alter table users add email_domain as substring_index(replace(email, '@', '.'), '.', -2);

create index idx on users(email_name, email_domain);

select *
from users
where exists
(
  select *
  from mytable other
  where other.id <> users.id
    and other.email_name = users.email_name
    and other.email_domain = users.email_dome
);

And of course, if you allow for both upper and lower case in the email address column, you will also want to apply LOWER on it in above expressions (lower(email)).

Mysql Matching "Same" Emails

4 Answers4