4

I am pulling many emails from an Exchange 2003 server and from those emails, trying to determine which are bounce-backs (invalid) so I can remove them from our contacts.

What would be the most efficient method of searching the email bodies to find email addresses on the bounce backs?

Paul
  • 2,330
  • 2
  • 19
  • 18
  • Efficient in what sense? In terms of speed? Accuracy? – Dan Diplo Jun 23 '11 at 19:31
  • Speed more than accuracy. All customers are USA (limited alphabet). I've looked here http://stackoverflow.com/questions/1028553/how-to-get-email-address-from-a-long-string, but that is a PHP answer, and I am not sure about splitting a C# string on white space (probably slow). Is Regex the way to go? – Paul Jun 23 '11 at 19:31
  • Probably a regular expression. Do you have some example text? – agent-j Jun 23 '11 at 19:31
  • Sample text is all over the place. Some are sys admin messages, others are specific to the email receiver. So any arbitrary message from any host. If there is an email in the body (I don't care what it is) I am going to match that email back against the emails sent and assume it is bad. – Paul Jun 23 '11 at 19:34
  • Regular expressions aren't very good for email addresses. They're [very hard to get right](http://ex-parrot.com/~pdw/Mail-RFC822-Address.html). – Brad Mace Jun 24 '11 at 06:21

4 Answers4

2

You might want to look at this page, which has several variants of regexes for matching email addresses and explains the trade-offs for selecting each. You should definitely read it before picking one here.

plinth
  • 48,267
  • 11
  • 78
  • 120
1

Just use a regex.

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b
therealmitchconnors
  • 2,732
  • 1
  • 18
  • 36
0

This is the regex that we use in a lot of our applications for email validation;

public static bool CheckEmail(string email)
    {
        //validate Email
        Regex regex = new Regex(@"^([a-zA-Z0-9_\-\.\']+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})$", RegexOptions.IgnoreCase);
        Match match = regex.Match(email);
        return match.Success;
    }

The actual process for correctly identifying a bounced email, rather than an auto-reply or genuine message is a little more complicated, but this will at least give you the email address.

ChrisBint
  • 12,773
  • 6
  • 40
  • 62
  • I'm developing the rules as I go, and then evaluating the "unmanaged" emails to create more rules. Real replies have their email in the 'from' which I can match back to who we sent. But the Regex is very helpful. I haven't had to touch regular expressions in very long. – Paul Jun 23 '11 at 19:40
  • I would be wary of the assumption that 'Real' replies have their email in the from address, from experience this is not the case. We actually use a commercial program that scans each email and will assign a category for each, along with executing a stored procedure to insert them directly into our DB. Nice and easy and saves me a lot of hassle of having to write my own rules. – ChrisBint Jun 23 '11 at 19:47
0

I pulled a few of the answers here into something like this. It actually returns each email address from the string (sometimes there are multiples from the mail host and target address). I can then match each of the email addresses up against the outbound addresses we sent, to verify. I used the article from @plinth to get a better understanding of the regular expression and modified the code from @Chris Bint

However, I'm still wondering if this is the fastest way to monitor 10,000+ emails? Are there any more efficient methods (while still using c#)? The live code won't recreate the Regex object every time within the loop.

public static MatchCollection CheckEmail(string email)
{
  Regex regex = new Regex(@"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b", RegexOptions.IgnoreCase);          
  MatchCollection matches = regex.Matches(email);

  return matches;
}
Paul
  • 2,330
  • 2
  • 19
  • 18