3

I'm wondering if anyone has a good regex to match email addresses, plus the common ways to obfuscate them, eg "joe [at] foo [dot] com". I'm not looking for a super regex that's completely RFC compliant. For example the following is mostly good enough:

^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$

I just need to tweak it for the most common ways to obfuscate email addresses. Yes, I know some people will outsmart it, and find a way to obfuscate their email addresses in ways that that the regex won't match, but I'm not worried about those situations.

Edit: Please read the whole question. I'm not asking about validating email addresses. I know there are thousands of posts on the web about that. I'm specifically looking into way to detect obfuscated email addresses.

mellowsoon
  • 22,273
  • 19
  • 57
  • 75

4 Answers4

4

How about something along the lines of this:

 ^[A-Z0-9\._%+-]+(@|\s*\[\s*at\s*\]\s*)[A-Z0-9\.-]+(\.|\s*\[\s*dot\s*\]\s*)[a-z]{2,6}$

Here's an example of it at work: http://regexr.com?2uh92

In short, it basically makes groups of options at the @ and at the . deliminators, using brackets. You could easily insert (\[|\() instead of the brackets to make them use parentheses optionally, which would match something like hi_there (at) gmail (dot) com.

Nightfirecat
  • 11,432
  • 6
  • 35
  • 51
  • That's very close to what I'm using now. My concern is the possibility that a regex like this would be *too* greedy. – mellowsoon Aug 23 '11 at 18:05
  • You can apply the non-greedy modifier to any of the spaces (I simply thought it might be a bit more flexible that way), but otherwise, I don't know that it would be. – Nightfirecat Aug 23 '11 at 18:11
2

I took the original script from @Nightfirecat and improved it a bit, since it couldn't match ie. these emails:

user @ domain.com

contact {@} guardian [dot] co [dot] uk

hello [[[@]]] jazzit (dot) hr

Here's the improved version of the regex:

[A-Z0-9\._%+-]+(\s*@\s*|\s*[\[|\{|\(]+\s*(at|@)\s*[\)|\}\]]+\s*)([A-Z0-9\.-]+(\.|\s*[\[|\{|\(]+\s*(dot|\.)\s*[\)|\}|\]]+\s*))+[a-z]{2,6}

Demo (or here - a non flash one)

trincot
  • 317,000
  • 35
  • 244
  • 286
madjoe
  • 195
  • 1
  • 10
  • Thanks! Modified it to match a few more cases `\b[A-Z0-9\._%+-]+([\[|\{|\(|\s*]*(at|@)[\s|\)|\}\]]*\s*)+[A-Z0-9\.-]+([A-Z0-9\.-]+(\.|\s*[\[|\{|\(]*\s*(dot|\.)\s*[\)|\}|\]]*\s*))[a-z]{2,6}\b` https://regexr.com/3oduk – Patric Apr 24 '18 at 07:24
1

This is based on Nightfirecat's answer. The following regex will match email addresses and common obfuscations in text:

[A-Z0-9\._%+-]+(?:\s*@\s*|\s*\[*\s*at\s*\]*\s*)+[A-Z0-9\.-]+(?:\s*\.\s*|\s*\[*\s*dot\s*\]*\s*)[a-z]{2,6}

This will find matches when any of the following are in strings of text:

obfuscated_emails = [
  "moo@doo.com",
  "m_oo@doo.co.uk",
  "moo @@ doo.com",
  "moo @ doo . com",
  "moo @ doo.com",
  "moo@doo . com",
  "moo@doo . co . uk",
  "moo@doo. co. uk",
  "m_oo @ doo.com",
  "moo [at] doo.com",
  "moo [at] doo . com",
  "moo [at] doo [dot] com",
  "m_oo [at] doo [dot] co [dot] uk",
  "moo at doo.com",
  "moo at doo . co . uk",
  "m_oo at doo . com",
  "moo at doo dot com"
]

If you do not need or want to match obfuscated email addresses in text just replace the "^" at the start and "$" at the end (or use \A and \z in Rails).

I am using this to make sure that users do not put email addresses in text where it does not belong (or warn them when they do). They are prompted to enter it elsewhere.

raveren
  • 17,799
  • 12
  • 70
  • 83
avjaarsveld
  • 579
  • 6
  • 9
  • Your regex seems to be the best one, but it also catches some false positives, example: "word wordWithAtintheend. anotherword" – Tiago Brito Oct 25 '17 at 09:43
  • it is very nice, but it fails on this case: `Email him directly at firstname.lastname@companyname.com` – deweydb Apr 24 '23 at 13:53
-2

As this answer explains, the correct pattern for detecting a valid mail address per the RFC 5322 specification is:

#!/usr/bin/env perl
use v5.10;

$rfc5322 = qr{

   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?&quoted_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?&quoted_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?&quoted_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?&quoted_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

   (?&address)

}x;

The Sticky Bit

Note that the (?&comment) production is full recursive, per the RFC 5322 specification. If you are using a toy regex engine that cannot handle recursion in patterns, then you will not be able to write a regex that correctly matches RFC 5322 mail address per the specification.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180