0

I am using this string for a regex "\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b" - which I am using to detect email addresses.

I want to figure out what the best way to escape it is.

I've tried a whole lot of variations e.g.

\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b
\\\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\\\.[A-Z]{2,4}\\\\b

I am using the regex in the @Match annotation so I don't think I can use StringEscapeUtils. The code is written in Java using the Play framework. But I imagine this is just an issue about escaping Java strings.

 public static void signup(
        @Match( value=("\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b"), 
            message="Hey there, we need a real email address so we can send you an invite. Thanks :)") String email){

        if(validation.hasErrors()) {
            params.flash(); // add http parameters to the flash scope
            validation.keep(); // keep the errors for the next request
            index();
        }
        else{
                Email mail = new Email();
                String[] to = {"myemail@me.com", "myemail@gmail.com"};
                mail.sendMessage(to, "beta signup", email);
                thanks();
        }
    }
Ankur
  • 50,282
  • 110
  • 242
  • 312
  • 2
    Since May 2010, email addresses can contain non-latin characters like Greek, Arabic, Chinese, etc. I'd reconsider your regex attempt. http://stackoverflow.com/questions/201323/how-to-use-a-regular-expression-to-validate-an-email-addresses/1931322#1931322 – BalusC May 08 '12 at 04:32
  • If your just wanting to play around, you could try reading it in from an external source, then java should do all the escaping for you. – dann.dev May 08 '12 at 04:38
  • @dann.dev I don't understand. The email variable gets sent from a view to a controller. When you say external source ... how would I do that and do the escaping? Sorry it's just not clicking in my brain. – Ankur May 08 '12 at 05:17
  • You can configure Eclipse to automatically escape stuff you paste in strings. – Joey May 08 '12 at 07:16

3 Answers3

2

Try this:

This regular expression implements the official RFC 2822 standard for email addresses. For general purposes it could be useful.

\b(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])\b

Explanation:

<!--
\b(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])\b

Options: case insensitive; ^ and $ match at line breaks

Assert position at a word boundary «\b»
Match the regular expression below «(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")»
   Match either the regular expression below (attempting the next alternative only if this one fails) «[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*»
      Match a single character present in the list below «[a-z0-9!#$%&'*+/=?^_`{|}~-]+»
         Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
         A character in the range between “a” and “z” «a-z»
         A character in the range between “0” and “9” «0-9»
         One of the characters “!#$%&'*+/=?^_`{|}” «!#$%&'*+/=?^_`{|}»
         The character “~” «~»
         The character “-” «-»
      Match the regular expression below «(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*»
         Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
         Match the character “.” literally «\.»
         Match a single character present in the list below «[a-z0-9!#$%&'*+/=?^_`{|}~-]+»
            Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
            A character in the range between “a” and “z” «a-z»
            A character in the range between “0” and “9” «0-9»
            One of the characters “!#$%&'*+/=?^_`{|}” «!#$%&'*+/=?^_`{|}»
            The character “~” «~»
            The character “-” «-»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*"»
      Match the character “"” literally «"»
      Match the regular expression below «(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*»
         Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
         Match either the regular expression below (attempting the next alternative only if this one fails) «[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]»
            Match a single character present in the list below «[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]»
               A character in the range between ASCII character 0x01 (1 decimal) and ASCII character 0x08 (8 decimal) «\x01-\x08»
               ASCII character 0x0b (11 decimal) «\x0b»
               ASCII character 0x0c (12 decimal) «\x0c»
               A character in the range between ASCII character 0x0e (14 decimal) and ASCII character 0x1f (31 decimal) «\x0e-\x1f»
               ASCII character 0x21 (33 decimal) «\x21»
               A character in the range between ASCII character 0x23 (35 decimal) and ASCII character 0x5b (91 decimal) «\x23-\x5b»
               A character in the range between ASCII character 0x5d (93 decimal) and ASCII character 0x7f (127 decimal) «\x5d-\x7f»
         Or match regular expression number 2 below (the entire group fails if this one fails to match) «\\[\x01-\x09\x0b\x0c\x0e-\x7f]»
            Match the character “\” literally «\\»
            Match a single character present in the list below «[\x01-\x09\x0b\x0c\x0e-\x7f]»
               A character in the range between ASCII character 0x01 (1 decimal) and ASCII character 0x09 (9 decimal) «\x01-\x09»
               ASCII character 0x0b (11 decimal) «\x0b»
               ASCII character 0x0c (12 decimal) «\x0c»
               A character in the range between ASCII character 0x0e (14 decimal) and ASCII character 0x7f (127 decimal) «\x0e-\x7f»
      Match the character “"” literally «"»
Match the character “@” literally «@»
Match the regular expression below «(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])»
   Match either the regular expression below (attempting the next alternative only if this one fails) «(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?»
      Match the regular expression below «(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+»
         Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
         Match a single character present in the list below «[a-z0-9]»
            A character in the range between “a” and “z” «a-z»
            A character in the range between “0” and “9” «0-9»
         Match the regular expression below «(?:[a-z0-9-]*[a-z0-9])?»
            Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
            Match a single character present in the list below «[a-z0-9-]*»
               Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
               A character in the range between “a” and “z” «a-z»
               A character in the range between “0” and “9” «0-9»
               The character “-” «-»
            Match a single character present in the list below «[a-z0-9]»
               A character in the range between “a” and “z” «a-z»
               A character in the range between “0” and “9” «0-9»
         Match the character “.” literally «\.»
      Match a single character present in the list below «[a-z0-9]»
         A character in the range between “a” and “z” «a-z»
         A character in the range between “0” and “9” «0-9»
      Match the regular expression below «(?:[a-z0-9-]*[a-z0-9])?»
         Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
         Match a single character present in the list below «[a-z0-9-]*»
            Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
            A character in the range between “a” and “z” «a-z»
            A character in the range between “0” and “9” «0-9»
            The character “-” «-»
         Match a single character present in the list below «[a-z0-9]»
            A character in the range between “a” and “z” «a-z»
            A character in the range between “0” and “9” «0-9»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]»
      Match the character “[” literally «\[»
      Match the regular expression below «(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}»
         Exactly 3 times «{3}»
         Match the regular expression below «(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)»
            Match either the regular expression below (attempting the next alternative only if this one fails) «25[0-5]»
               Match the characters “25” literally «25»
               Match a single character in the range between “0” and “5” «[0-5]»
            Or match regular expression number 2 below (attempting the next alternative only if this one fails) «2[0-4][0-9]»
               Match the character “2” literally «2»
               Match a single character in the range between “0” and “4” «[0-4]»
               Match a single character in the range between “0” and “9” «[0-9]»
            Or match regular expression number 3 below (the entire group fails if this one fails to match) «[01]?[0-9][0-9]?»
               Match a single character present in the list “01” «[01]?»
                  Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
               Match a single character in the range between “0” and “9” «[0-9]»
               Match a single character in the range between “0” and “9” «[0-9]?»
                  Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
         Match the character “.” literally «\.»
      Match the regular expression below «(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)»
         Match either the regular expression below (attempting the next alternative only if this one fails) «25[0-5]»
            Match the characters “25” literally «25»
            Match a single character in the range between “0” and “5” «[0-5]»
         Or match regular expression number 2 below (attempting the next alternative only if this one fails) «2[0-4][0-9]»
            Match the character “2” literally «2»
            Match a single character in the range between “0” and “4” «[0-4]»
            Match a single character in the range between “0” and “9” «[0-9]»
         Or match regular expression number 3 below (attempting the next alternative only if this one fails) «[01]?[0-9][0-9]?»
            Match a single character present in the list “01” «[01]?»
               Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
            Match a single character in the range between “0” and “9” «[0-9]»
            Match a single character in the range between “0” and “9” «[0-9]?»
               Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
         Or match regular expression number 4 below (the entire group fails if this one fails to match) «[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+»
            Match a single character present in the list below «[a-z0-9-]*»
               Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
               A character in the range between “a” and “z” «a-z»
               A character in the range between “0” and “9” «0-9»
               The character “-” «-»
            Match a single character present in the list below «[a-z0-9]»
               A character in the range between “a” and “z” «a-z»
               A character in the range between “0” and “9” «0-9»
            Match the character “:” literally «:»
            Match the regular expression below «(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+»
               Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
               Match either the regular expression below (attempting the next alternative only if this one fails) «[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]»
                  Match a single character present in the list below «[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]»
                     A character in the range between ASCII character 0x01 (1 decimal) and ASCII character 0x08 (8 decimal) «\x01-\x08»
                     ASCII character 0x0b (11 decimal) «\x0b»
                     ASCII character 0x0c (12 decimal) «\x0c»
                     A character in the range between ASCII character 0x0e (14 decimal) and ASCII character 0x1f (31 decimal) «\x0e-\x1f»
                     A character in the range between ASCII character 0x21 (33 decimal) and ASCII character 0x5a (90 decimal) «\x21-\x5a»
                     A character in the range between ASCII character 0x53 (83 decimal) and ASCII character 0x7f (127 decimal) «\x53-\x7f»
               Or match regular expression number 2 below (the entire group fails if this one fails to match) «\\[\x01-\x09\x0b\x0c\x0e-\x7f]»
                  Match the character “\” literally «\\»
                  Match a single character present in the list below «[\x01-\x09\x0b\x0c\x0e-\x7f]»
                     A character in the range between ASCII character 0x01 (1 decimal) and ASCII character 0x09 (9 decimal) «\x01-\x09»
                     ASCII character 0x0b (11 decimal) «\x0b»
                     ASCII character 0x0c (12 decimal) «\x0c»
                     A character in the range between ASCII character 0x0e (14 decimal) and ASCII character 0x7f (127 decimal) «\x0e-\x7f»
      Match the character “]” literally «\]»
Assert position at a word boundary «\b»
-->
Cylian
  • 10,970
  • 4
  • 42
  • 55
  • Thanks for the help, but I guess the issue I need to work out is the escaping issue. I will probably come up against this for all kinds of different regexs. – Ankur May 08 '12 at 05:20
  • 1
    Then just use this ``\\b[A-Z0-9._%+\\-]+@[A-Z0-9.\\-]+\\.[A-Z]{2,4}\\b`` (your RegEx, escaped). – Cylian May 08 '12 at 05:27
0

You can find RFC 2822 here

http://www.ietf.org/rfc/rfc2822.txt

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)\b
Bhavik Ambani
  • 6,557
  • 14
  • 55
  • 86
0

I will not get into the "this is the correct regex for email" thing, just one remark: Your regex will not accept all valid email addresses. See the link BalusC gave you in the comment.

Regarding the escaping. Java needs double escaping, since it treads the regex at first as string and will handle all escape sequences during the string creation. So, just escape all backslashes, because they need to be there after the replacement.

\\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}\\b

The dash at the end of a character class does not need to be escaped.

Community
  • 1
  • 1
stema
  • 90,351
  • 20
  • 107
  • 135