1

I am trying to come up with some regular expression to check whether an e-mail exists in a database. What is more specific here is that I want to find e-mails that are the same, but might be written differently. E.g. john.doe@example.com is the same as johndode+123@example.com and j.o.h.n.d.o.e@example.com.

Using another script I remove all dots and text after + so that john.doe+123@example.com becomes johndoe@example.com. How can I use this "stripped" address to match its aliases in a MySQL database?

I have little experience with RegEx and I haven't really used it alongside SQL before. Anyway, I could only come up with the following code:

RegEx (matches all +<any text> and . before the @ symbol

(\+.*(?=@)|\.(?=.*@))

SQL

SELECT email FROM users WHERE email REGEXP '(\+.*(?=@)|\.(?=.*@))'
//or
SELECT * from users WHERE email REGEXP_LIKE('johndoe@example.com', '(\+.*(?=@)|\.(?=.*@))')

I get the following error for both:

#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near REGEXP_LIKE('johndoe@example.com', '(\+.*(?=@)|\.(?=.*@))')

What I am trying to do is select email from users where email = 'johndoe@example.com' but disregard these characters (\+.*(?=@)|\.(?=.*@)
(the code is executed as a prepared statement)

Any tip or advice would be highly appreciated!

user10398534
  • 145
  • 14
  • 2
    The dots in the email address *do matter*. `john.doe@example.com` and `johndoe@example.com` *could* be two different addresses. It depends on the mail provider - Gmail does make them the same but that's not necessary. So, it's not possible to conclude *generically* for just any email address that the dot is superfluous. It might very well differentiate two different people `Joe Smith` and `Jo E. Smith` are two distinct names but one could have the `joe.smith@example.com` email, the other `jo.e.smith@example.com` without clashes. – VLAZ Apr 18 '20 at 10:51
  • @VLAZ That is true. If I understand how to use this sort of RegEx in SQL, I think I'm going to make a separate check if the email is `@gmail.com` (in fact I already have this part of the e-mail separated in another variable) and only then disregard dots as well. – user10398534 Apr 18 '20 at 10:56
  • instead of building a complex regex remove the dots and numbers before the @ and compare that. but emailadresses you can get easy and use unicode so that it looks equal but couldn't be detected at least not with even mire detection alghorithms – nbk Apr 18 '20 at 11:02
  • @nbk If the user wants to be contacted at `john.doe+folder@example.com` I want to e-mail them at this address as they have typed it. Nevertheless, some e-mail providers will store e-mails sent to `+` in separate folders and many users use that to organise e-mails. The whole point of my question is to prevent people from using the same e-mail to register twice in my platform as there is a free trial for new accounts and many people are abusing it. – user10398534 Apr 18 '20 at 11:17
  • 1
    you can't as long there is no pattern to it. nbk0123 nbk1234 nbk1234456677 and nbk12233435456657 could be all the same persons, but how do you know it for sure , as i help different people and we seach for an emailadress at a provider everything is taken, that use the same name & everything else, it is quote ahrd to find a good email address that is not be taken. Only comapny habe a policy how to name especially when there are 20 john A. Smith in the company. so it is futile trying to identify single persons. – nbk Apr 18 '20 at 11:25
  • https://database.guide/how-the-regex_replace-function-works-in-mysql/ - maybe you can use this to remove the characters you want to ignore than do the match – Serban Gorcea Apr 18 '20 at 11:28
  • 1
    @nbk I don't think the intention is to be able to guess if two unrelated addresses are related, it's to detect variations of the same base address: `vlaz+1@example.com` and `vlaz+2@example.com` *are* (incredibly likely) the same base `vlaz@example.com` address but use the `+` to separate incoming mail. You're right that `vlaz1@example.com` and `vlaz2@example.com` might or might not belong to the same person and we cannot tell. Yet that's not the core of the query here. – VLAZ Apr 18 '20 at 11:36
  • I understand your idea, but you have to find a field pattern and use it to check. if vlaz comes only with numbers behind his name easy, if he uses birdsnames in 152 languages no chance. We had problems with spammers,and every spam campaign had its own set of rules, in time you can find most.of them. – nbk Apr 18 '20 at 11:50
  • With respect, what you're trying to do is a bad idea. Email standards say the interpretation of the local part of an address (the `mickey.mouse` in `mickey.mouse@disney.com`) is *entirely* up to the receiving mail server. Sure, many mail servers treat `mickey.mouse+steamboat_willie` as `mickey.mouse`. But nobody has promised all servers do that. I've tried what you're doing a couple of times, and always had to back it out to meet user needs. – O. Jones Apr 18 '20 at 13:21

1 Answers1

1

In MySQL, the simplest method might be breaking the problem into the username and domain:

where replace(substring_index(substring_index(email, '@', 1), '+', 1), '.', '') = substring_index('johndoe@example.com', '@', 1) and
      substring_index(email, '@', -1) = substring_index('johndoe@example.com', '@', -1) and
      email like '%@%' and
      email not like '%@%@%'
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • This is exactly what I was looking for, but I must admit I do not quite understand what the function of the last two lines is. `email like '%@%' and email not like '%@%@%'` – user10398534 Apr 18 '20 at 13:15
  • 1
    Email addresses MUST contain one `@` and MUST NOT contain more than one. – O. Jones Apr 18 '20 at 13:17
  • @O. I have already verified that the e-mail is valid using various other methods. I am going to go ahead and skip the last 2 lines. – user10398534 Apr 18 '20 at 14:34
  • [Email addresses MUST contain one @ and MUST NOT contain more than one.] This doesn't check if there are 3 or more @'s. Maybe its better to do a check count @ < 2 – Ezani Aug 09 '21 at 08:51
  • @Ezani . . . Actually, it does. That is what the `not like` does. – Gordon Linoff Aug 09 '21 at 12:14