1

Validating E-mail Ids according to RFC5322 and following

https://en.wikipedia.org/wiki/Email_address

Below is the sample code using java and a regular expression to validate E-mail Ids.

public void checkValid() {
    List<String> emails = new ArrayList();
    //Valid Email Ids
    emails.add("simple@example.com");
    emails.add("very.common@example.com");                   
    emails.add("disposable.style.email.with+symbol@example.com");
    emails.add("other.email-with-hyphen@example.com");
    emails.add("fully-qualified-domain@example.com");
    emails.add("user.name+tag+sorting@example.com");
    emails.add("fully-qualified-domain@example.com");
    emails.add("x@example.com");
    emails.add("carlosd'intino@arnet.com.ar");
    emails.add("example-indeed@strange-example.com");
    emails.add("admin@mailserver1");
    emails.add("example@s.example");
    emails.add("\" \"@example.org");
    emails.add("\"john..doe\"@example.org");

    //Invalid emails Ids
    emails.add("Abc.example.com");
    emails.add("A@b@c@example.com");
    emails.add("a\"b(c)d,e:f;g<h>i[j\\k]l@example.com");
    emails.add("just\"not\"right@example.com");
    emails.add("this is\"not\\allowed@example.com");
    emails.add("this\\ still\"not\\allowed@example.com");
                    emails.add("1234567890123456789012345678901234567890123456789012345678901234+x@example.com");
    emails.add("john..doe@example.com");
    emails.add("john.doe@example..com");

    String regex = "^[a-zA-Z0-9_!#$%&'*+/=? \\\"`{|}~^.-]+@[a-zA-Z0-9.-]+$";

    Pattern pattern = Pattern.compile(regex);
    int i=0;
    for(String email : emails){
        Matcher matcher = pattern.matcher(email);
        System.out.println(++i +"."+email +" : "+ matcher.matches());
    }
}

Actual Output:

   1.simple@example.com : true
   2.very.common@example.com : true
   3.disposable.style.email.with+symbol@example.com : true
   4.other.email-with-hyphen@example.com : true
   5.fully-qualified-domain@example.com : true
   6.user.name+tag+sorting@example.com : true
   7.fully-qualified-domain@example.com : true
   8.x@example.com : true
   9.carlosd'intino@arnet.com.ar : true
   10.example-indeed@strange-example.com : true
   11.admin@mailserver1 : true
   12.example@s.example : true
   13." "@example.org : true
   14."john..doe"@example.org : true
   15.Abc.example.com : false
   16.A@b@c@example.com : false
   17.a"b(c)d,e:f;g<h>i[j\k]l@example.com : false
   18.just"not"right@example.com : true
   19.this is"not\allowed@example.com : false
   20.this\ still"not\allowed@example.com : false
   21.1234567890123456789012345678901234567890123456789012345678901234+x@example.com    : true
   22.john..doe@example.com : true
   23.john.doe@example..com : true

Expected Ouput:

1.simple@example.com : true
2.very.common@example.com : true
3.disposable.style.email.with+symbol@example.com : true
4.other.email-with-hyphen@example.com : true
5.fully-qualified-domain@example.com : true
6.user.name+tag+sorting@example.com : true
7.fully-qualified-domain@example.com : true
8.x@example.com : true
9.carlosd'intino@arnet.com.ar : true
10.example-indeed@strange-example.com : true
11.admin@mailserver1 : true
12.example@s.example : true
13." "@example.org : true
14."john..doe"@example.org : true
15.Abc.example.com : false
16.A@b@c@example.com : false
17.a"b(c)d,e:f;g<h>i[j\k]l@example.com : false
18.just"not"right@example.com : false
19.this is"not\allowed@example.com : false
20.this\ still"not\allowed@example.com : false
21.1234567890123456789012345678901234567890123456789012345678901234+x@example.com : false
22.john..doe@example.com : false
23.john.doe@example..com : false

How can I change my regular expression so that it will invalidate the below patterns of email ids.

1234567890123456789012345678901234567890123456789012345678901234+x@example.com
john..doe@example.com
john.doe@example..com 
just"not"right@example.com

Below are the criteria for regular expression:

Local-part

The local-part of the email address may use any of these ASCII characters:

  1. uppercase and lowercase Latin letters A to Z and a to z;
  2. digits 0 to 9;
  3. special characters !#$%&'*+-/=?^_`{|}~
  4. dot ., provided that it is not the first or last character unless quoted, and provided also that it does not appear consecutively unless quoted (e.g. John..Doe@example.com is not allowed but "John..Doe"@example.com is allowed);
  5. space and "(),:;<>@[\] characters are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash); comments are allowed with parentheses at either end of the local-part; e.g. john.smith(comment)@example.com and (comment)john.smith@example.com are both equivalent to john.smith@example.com.

Domain

  1. uppercase and lowercase Latin letters A to Z and a to z;
  2. digits 0 to 9, provided that top-level domain names are not all-numeric;
  3. hyphen -, provided that it is not the first or last character. Comments are allowed in the domain as well as in the local-part; for example, john.smith@(comment)example.com and john.smith@example.com(comment) are equivalent to john.smith@example.com.
Mihir
  • 572
  • 1
  • 6
  • 24
  • https://stackoverflow.com/questions/13992403/regex-validation-of-email-addresses-according-to-rfc5321-rfc5322 – assylias Nov 14 '18 at 11:41
  • @assylias The regex given in above link is specific to some other language.I have also tried same but it is not working. – Mihir Nov 14 '18 at 11:43
  • And more importantly: https://stackoverflow.com/a/1903368/829571 (read the introductory comment) – assylias Nov 14 '18 at 11:43
  • Even better: https://stackoverflow.com/questions/624581/what-is-the-best-java-email-address-validation-method – Sofo Gial Nov 14 '18 at 11:43
  • @assylias I understand that it is recommended to use the `MailAddress` class.But I need a regex to validate email Id.The regex shared in that post also not working for me that is specific to perl. – Mihir Nov 14 '18 at 11:50
  • @SofoGial Using ESAPI validator but the regular expression used by ESAPI is not according to RFC5322 and https://en.wikipedia.org/wiki/Email_address#cite_note-rfc3696-8 – Mihir Nov 14 '18 at 11:52
  • What are the criteria here? You should write them in your post, instead of providing external links. – vrintle Nov 17 '18 at 08:07
  • 1
    @Mihir: Those four emails that you want to invalidate, can you give the logic why they should not be considered valid? This will help me design a precise regex that works for you. – Pushpesh Kumar Rajwanshi Nov 17 '18 at 17:20
  • @PushpeshKumarRajwanshi Added the criteria to the post – Mihir Nov 19 '18 at 08:41
  • @rv7 Added the criteria to the post. – Mihir Nov 19 '18 at 08:42
  • There is a confusion in your 11th example. It doesn't have any `.something` extension, however it is true. So, do you want `.something` to be optional? – vrintle Nov 19 '18 at 09:05
  • @rv7 Yes `.something` is optional. – Mihir Nov 19 '18 at 09:33
  • **⚠️** Warning! [Email Address Internationalization (EAI)](https://en.wikipedia.org/wiki/Email_address#Internationalization) is coming. Expect this to get far more complicated (though you can roughly just expand validators to accept any non-ASCII character anywhere). – Adam Katz Nov 19 '18 at 21:18
  • how the cases 18 and 21 are false? – vrintle Nov 20 '18 at 03:20
  • @rv7 18 is false because quoted strings must be dot separated or space separated example `" "@example.org ` and `"john..doe"@example.org` . 21 is false because local part is longer than 64 characters. – Mihir Nov 20 '18 at 07:00

3 Answers3

4

It's not the question you asked, but why re-invent the wheel?

Apache commons has a class that covers this already.

org.apache.commons.validator.routines.EmailValidator.getInstance().isValid(email)

This way you aren't responsible for keeping up to date with changing email format standards.

Jakg
  • 922
  • 12
  • 39
  • Apache EmailValidator provides email address validation according to RFC 822 standards. The question here is regarding RFC 5322. – Luixv May 18 '20 at 21:38
  • [JMail](https://github.com/RohanNagar/jmail) is a new library that is compliant with RFC 5322 and is faster and more correct than Apache Commons. Plus, it is customizable so you can treat addresses with domain literals (like user@localhost) as invalid. – Rohan Jun 01 '21 at 23:22
3

A regular expression is the most difficult and error-prone way to validate emails addresses. If you are using an implementation of javax.mail to send the emails, then the simplest way to determine if it will work is by using the provided parser, because whether the email is compliant or not, if the library cannot use it, then it doesn't matter.

public static boolean validateEmail(String address) {
    try {
        // if this fails, the mail library can't send emails to this address
        InternetAddress ia = new InternetAddress(address, true);
        return ia.isGroup() && ia.getAddress().charAt(0) != '@';
    }
    catch (Throwable t) {
        return false;
    }
}

Invoking it with false allows emails without a @domain part when strict parsing. And since the checkAddress function invoked internally is private and we can't just call checkAddress(addr,false,true) since we don't want routing information (a feature practically designed for fraud through server bouncing), we have to check the first letter of the validated address.

Now what you may notice here is that this validation method is actually compliant to RFC 2822, rather than 5822. The reason for this is because unless you are implementing your own SMTP sender library, then you're using one that depends on this one, and if you have an address that is 5822-valid but 2822-invalid, then your 5822-validation is rendered useless. But if you are implementing your own 5822 SMTP library, then you should learn from the existing ones and write a parser function, rather than a regular expression.

coladict
  • 4,799
  • 1
  • 16
  • 27
  • isGroup() returns false for all the positive test cases – ch1ll Aug 23 '22 at 08:08
  • 1
    @ch1ll I may have missed an exclamation point there in some edit. I roughly remember that it should not be a group. I might edit, but this is the basic idea. – coladict Aug 23 '22 at 10:55
3

You could RFC5322 like this
( reference regex modified )

"(?im)^(?=.{1,64}@)(?:(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"@)|((?:[0-9a-z](?:\\.(?!\\.)|[-!#\\$%&'\\*\\+/=\\?\\^`\\{\\}\\|~\\w])*)?[0-9a-z]@))(?=.{1,255}$)(?:(\\[(?:\\d{1,3}\\.){3}\\d{1,3}\\])|((?:(?=.{1,63}\\.)[0-9a-z][-\\w]*[0-9a-z]*\\.)+[a-z0-9][\\-a-z0-9]{0,22}[a-z0-9])|((?=.{1,63}$)[0-9a-z][-\\w]*))$"  

https://regex101.com/r/ObS3QZ/1

 # (?im)^(?=.{1,64}@)(?:("[^"\\]*(?:\\.[^"\\]*)*"@)|((?:[0-9a-z](?:\.(?!\.)|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)?[0-9a-z]@))(?=.{1,255}$)(?:(\[(?:\d{1,3}\.){3}\d{1,3}\])|((?:(?=.{1,63}\.)[0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9])|((?=.{1,63}$)[0-9a-z][-\w]*))$

 # Note - remove all comments '(comments)' before runninig this regex
 # Find  \([^)]*\)  replace with nothing

 (?im)                                     # Case insensitive
 ^                                         # BOS

                                           # Local part
 (?= .{1,64} @ )                           # 64 max chars
 (?:
      (                                         # (1 start), Quoted
           " [^"\\]* 
           (?: \\ . [^"\\]* )*
           "
           @
      )                                         # (1 end)
   |                                          # or, 
      (                                         # (2 start), Non-quoted
           (?:
                [0-9a-z] 
                (?:
                     \.
                     (?! \. )
                  |                                          # or, 
                     [-!#\$%&'\*\+/=\?\^`\{\}\|~\w] 
                )*
           )?
           [0-9a-z] 
           @
      )                                         # (2 end)
 )
                                           # Domain part
 (?= .{1,255} $ )                          # 255 max chars
 (?:
      (                                         # (3 start), IP
           \[
           (?: \d{1,3} \. ){3}
           \d{1,3} \]
      )                                         # (3 end)
   |                                          # or,   
      (                                         # (4 start), Others
           (?:                                       # Labels (63 max chars each)
                (?= .{1,63} \. )
                [0-9a-z] [-\w]* [0-9a-z]* 
                \.
           )+
           [a-z0-9] [\-a-z0-9]{0,22} [a-z0-9] 
      )                                         # (4 end)
   |                                          # or,
      (                                         # (5 start), Localdomain
           (?= .{1,63} $ )
           [0-9a-z] [-\w]* 
      )                                         # (5 end)
 )
 $                                         # EOS

How make sudhansu_@gmail.com this as valid email ID – Mihir Feb 7 at 9:34

I think the spec wants the local part to be either encased in quotes
or, to be encased by [0-9a-z].

But, to get around the later and make sudhansu_@gmail.com valid, just
replace group 2 with this:

      (                             # (2 start), Non-quoted
           [0-9a-z] 
           (?:
                \.
                (?! \. )
             |                              # or, 
                [-!#\$%&'\*\+/=\?\^`\{\}\|~\w] 
           )*
           @

      )                             # (2 end)

New regex

"(?im)^(?=.{1,64}@)(?:(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"@)|([0-9a-z](?:\\.(?!\\.)|[-!#\\$%&'\\*\\+/=\\?\\^`\\{\\}\\|~\\w])*@))(?=.{1,255}$)(?:(\\[(?:\\d{1,3}\\.){3}\\d{1,3}\\])|((?:(?=.{1,63}\\.)[0-9a-z][-\\w]*[0-9a-z]*\\.)+[a-z0-9][\\-a-z0-9]{0,22}[a-z0-9])|((?=.{1,63}$)[0-9a-z][-\\w]*))$"

New demo

https://regex101.com/r/ObS3QZ/5