8

I have not found a regexp to do this. I need to validate the "Message-ID:" value from an email. It is similar to a email address validation regexp but much simpler, without most of the edge cases the email address allows, from rfc2822

msg-id          =       [CFWS] "<" id-left "@" id-right ">" [CFWS] 
id-left         =       dot-atom-text / no-fold-quote / obs-id-left
id-right        =       dot-atom-text / no-fold-literal / obs-id-right
no-fold-quote   =       DQUOTE *(qtext / quoted-pair) DQUOTE
no-fold-literal =       "[" *(dtext / quoted-pair) "]"

Let's say the outter <> are optional. dot-atom-text and missing definitions can be found in rfc2822

I am not proficient in regex and I prefer to use an already tested one, if exists.

Persimmonium
  • 15,593
  • 11
  • 47
  • 78

4 Answers4

8

If anyone's interested, one of our senior architects worked through the many layers of RFC 2822 and came up with the following regex which includes quoting on the left and right sides. The spec says that new implementations should not use the obsolete characters, so this regex does not allow them:

((([a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*)|("(([\x01-\x08\x0B\x0C\x0E-\x1F\x7F]|[\x21\x23-\x5B\x5D-\x7E])|(\\[\x01-\x09\x0B\x0C\x0E-\x7F]))*"))@(([a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*)|(\[(([\x01-\x08\x0B\x0C\x0E-\x1F\x7F]|[\x21-\x5A\x5E-\x7E])|(\\[\x01-\x09\x0B\x0C\x0E-\x7F]))*\])))
Nathan
  • 1,418
  • 16
  • 32
  • 1
    And here is that as a Python regex string, with non-matching groups: `r'((?:(?:[a-zA-Z0-9!#$%&\'*+/=?^_\`{|}~-]+(?:\.[a-zA-Z0-9!#$%&\'*+/=?^_\`{|}~-]+)*)|(?:"(?:(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x7F]|[\x21\x23-\x5B\x5D-\x7E])|(?:\\[\x01-\x09\x0B\x0C\x0E-\x7F]))*"))@(?:(?:[a-zA-Z0-9!#$%&\'*+/=?^_\`{|}~-]+(?:\.[a-zA-Z0-9!#$%&\'*+/=?^_\`{|}~-]+)*)|(?:\[(?:(?:[\x01-\x08\x0B\x0C\x0E-\x1F\x7F]|[\x21-\x5A\x5E-\x7E])|(?:\\[\x01-\x09\x0B\x0C\x0E-\x7F]))*\])))'` – Alastair Irvine Aug 10 '18 at 18:16
3

As I could not find any I ended up implementing it myself. It is not a proper validation as per RFC2822 but a good enough aproximation for now:

static String VALIDMIDPATTERN = "[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*";
private static Pattern patvalidmid = Pattern.compile(VALIDMIDPATTERN);

public static boolean isMessageIdValid(String midt) {
    String mid = midt;
    if (StringUtils.countMatches(mid, "<") > 1)
        return false;
    if (StringUtils.countMatches(mid, ">") > 1)
        return false;
    if (StringUtils.containsAny(mid, "<>")) {
        mid = StringUtils.substringBetween(mid, "<", ">");
        if (StringUtils.isBlank(mid)) {
            return false;
        }
    }
    if (StringUtils.contains(mid, "..")) {
        return false;
    }
    //extract from <>
    mid = mid.trim();
    //now validate
    Matcher m = patvalidmid.matcher(mid);
    return m.matches();
}
Persimmonium
  • 15,593
  • 11
  • 47
  • 78
  • Does this regex only cover the `dot-atom-text` part of `id-left` and `id-right`? I think `obs-id-left` would allow virtually any ASCII character. – Nathan Jul 12 '13 at 02:25
  • @Nathan: so long ago...don't remember anymore. Probably you are right, but I can tell you this impl. has been used to process many millions of message-ids in a commercial product successfully. So it is probably safe to use. – Persimmonium Jul 12 '13 at 07:02
  • Thanks for the confirmation on your production usage - that helps confirm that the quoting and obsolete parts of id-left and id-right aren't actually used in practice. – Nathan Jul 14 '13 at 23:04
0

It is not possible to perfectly match an RFC2822 Message-ID using standard regular expressions because the CFWS rule allows nesting of comments, which regexes can't cope with. e.g.

<foo@bar.com> (comment (another comment))
KingPong
  • 1,439
  • 1
  • 16
  • 22
-3

try somthing like --> ^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,}$

cordellcp3
  • 3,557
  • 1
  • 17
  • 14
  • this is not correct, it is not exactly like an email address, the part after @ does not need to be 'something.something' – Persimmonium Oct 19 '10 at 14:49