3

I've got a webhook posting to a form on my web application and I need to parse out the email header addresses.

Here is the source text:

Thread-Topic: test subject
Thread-Index: AcwE4mK6Jj19Hgi0SV6yYKvj2/HJbw==
From: "Lastname, Firstname" <firstname_lastname@domain.com>
To: <testto@domain.com>, testto1@domain.com, testto2@domain.com
Cc: <testcc@domain.com>, test3@domain.com
X-OriginalArrivalTime: 27 Apr 2011 13:52:46.0235 (UTC) FILETIME=[635226B0:01CC04E2]

I'm looking to pull out the following:

<testto@domain.com>, testto1@domain.com, testto2@domain.com

I'm been struggling with Regex all day without any luck.

Ahmad Mageed
  • 94,561
  • 19
  • 163
  • 174
Kevin Jensen
  • 1,418
  • 3
  • 18
  • 25

5 Answers5

6

Contrary to some of the posts here I have to agree with mmutz, you cannot parse emails with a regex... see this article:

https://www.rfc-editor.org/rfc/rfc2822#section-3.4.1

3.4.1. Addr-spec specification

An addr-spec is a specific Internet identifier that contains a locally interpreted string followed by the at-sign character ("@", ASCII value 64) followed by an Internet domain.

The idea of "locally interpreted" means that only the receiving server is expected to be able to parse it.

If I were going to try and solve this I would find the "To" line contents, break it apart and attempt to parse each segment with System.Net.Mail.MailAddress.

    static void Main()
    {
        string input = @"Thread-Topic: test subject
Thread-Index: AcwE4mK6Jj19Hgi0SV6yYKvj2/HJbw==
From: ""Lastname, Firstname"" <firstname_lastname@domain.com>
To: <testto@domain.com>, ""Yes, this is valid""@[emails are hard to parse!], testto1@domain.com, testto2@domain.com
Cc: <testcc@domain.com>, test3@domain.com
X-OriginalArrivalTime: 27 Apr 2011 13:52:46.0235 (UTC) FILETIME=[635226B0:01CC04E2]";

        Regex toline = new Regex(@"(?im-:^To\s*:\s*(?<to>.*)$)");
        string to = toline.Match(input).Groups["to"].Value;

        int from = 0;
        int pos = 0;
        int found;
        string test;
        
        while(from < to.Length)
        {
            found = (found = to.IndexOf(',', from)) > 0 ? found : to.Length;
            from = found + 1;
            test = to.Substring(pos, found - pos);

            try
            {
                System.Net.Mail.MailAddress addy = new System.Net.Mail.MailAddress(test.Trim());
                Console.WriteLine(addy.Address);
                pos = found + 1;
            }
            catch (FormatException)
            {
            }
        }
    }

Output from the above program:

testto@domain.com
"Yes, this is valid"@[emails are hard to parse!]
testto1@domain.com
testto2@domain.com
Community
  • 1
  • 1
csharptest.net
  • 62,602
  • 11
  • 71
  • 89
  • this looks very promising...doing some unit testing right now. – Kevin Jensen Apr 27 '11 at 17:58
  • @Blindy Yea, very "right-ISH" I agree. Without a library it's hopefully 'good-enough'. – csharptest.net Apr 27 '11 at 19:32
  • Yep I think 'good enough' is the right term. I'm going to log every request, and mark any messages that don't parse so I can re-evaluate after some volume. – Kevin Jensen Apr 27 '11 at 20:07
  • @csharptest.net Been using this code since 2017 without problems but all of the sudden my IDE started complaining about the regex: `'Option character' expected`. The problem here is the `?im-:` part. All modes following the `-` sign are turned off but there are none in your expression. IMO the only thing making sense here is `?im` (ignore case, multi-line mode) since C# Regex default modes are case-sensitive and single-line. You could also do `new Regex(@"(^To\s*:\s*(?.*)$)", RegexOptions.IgnoreCase | RegexOptions.Multiline)` – Michael Schnerring Jun 09 '20 at 08:33
2

The RFC 2822-compliant email regex is:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Just run it over your text and you'll get the email addresses.

Of course, there's always the option of not using regex where regex isn't the best option. But up to you!

Blindy
  • 65,249
  • 10
  • 91
  • 131
  • 2
    BTW, your 'RFC' regex for emails does not handle quoted-string properly, if fails to match: "Yes, this is valid"@domain.com – csharptest.net Apr 27 '11 at 17:18
  • 1
    "almost" RFC-compliant then I guess. Just goes to show, regex isn't the best tool for this :) – Blindy Apr 27 '11 at 17:33
0

You cannot use regular expressions to parse RFC2822 mails, because their grammar contains a recursive production (off the top of my head, it was for comments (a (nested) comment)) which makes the grammar non-regular. Regular expressions (as the name suggests) can only parse regular grammars.

See also RegEx match open tags except XHTML self-contained tags for more information.

Community
  • 1
  • 1
Marc Mutz - mmutz
  • 24,485
  • 12
  • 80
  • 90
  • 1
    While you are right in an academic context, any PCRE (which C#'s implementation is part of) is more than a plain old regular expression parser, it's closer to a context free grammar parser, which can indeed parse recursive parenthesis. This is a case of technology outgrowing the name of the construct. – Blindy Apr 27 '11 at 19:08
0

As Blindy suggests, sometimes you can just parse it out the old-fashioned way.

If you prefer to do that, here is a quick approach assuming the email header text is called 'header':

int start = header.IndexOf("To: ");
int end = header.IndexOf("Cc: ");
string x = header.Substring(start, end-start);

I may be off by a byte on the subtraction but you can very easily test and modify this. Of course you will also have to be certain you always will have a Cc: row in your header or this won't work.

nycdan
  • 2,819
  • 2
  • 21
  • 33
0

There's a breakdown of validating emails with regex here, which references a more practical implementation of RFC 2822 with:

[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?

It also looks like you only want the email addresses out of the "To" field, and you've got the <> to worry about as well, so something like the following would likely work:

^To: ((?:\<?[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\>?,?(?:\s*))*)

Again, as others having mentioned, you might not want to do this. But if you want regex that will turn that input into <testto@domain.com>, testto1@domain.com, testto2@domain.com, that'll do it.

Ian Pugsley
  • 1,062
  • 8
  • 20