0

I have a regex to detect any email address - I am trying to create a regex that looks specifically in the header of an email message that counts email addresses and ignores email addresses from a specific domain (abc.com).

For example, there's ten email addresses from 1@test.com ignoring the 11th address from 2@abc.com.

Current regex:

^[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}$

user1181862
  • 21
  • 1
  • 6
  • What language are you using? – David Starkey May 02 '13 at 20:14
  • 6
    Are you aware that's it's not possible to create a regex that covers all the valid e-mail addresses as per RFC? – Violet Giraffe May 02 '13 at 20:52
  • 3
    Don't use a regex for that. Use a full-blown email header parser, then filter for the domains you want (with a regex if it needs to be), and count the result. – Bergi May 02 '13 at 21:53
  • 1
    I am using Java regex engine – user1181862 May 03 '13 at 00:31
  • this link http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address has a pretty good regex for looking for email addresses while at the same time not using any lookbehinds which are flaky in java. – Ro Yo Mi May 03 '13 at 03:08

1 Answers1

1

Consider the following powershell example of a universal regex.

To find all email addresses:

  • <(.*?)> is handy if your server surrounds the email addresses with brackets
  • (?<!Content-Type(.|\n){0,10000000})([a-zA-Z0-9.!#$%&''*+-/=?\^_``{|}~-]+@(?!abc.com)[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*) if you don't have brackets around all email addresses in your header. Note this particular regex was copied from a community wiki answer on stackoverflow 201323 and modified here to prevent @abc.com. There are probably some edge cases which this regex will not work for. So on the same page there is really complex regex which looks like it would match every email address. I don't have the time to modify that one to skip @abc.com.

Example

    $Matches = @()
    $String = 'Return-Path: <example_from@abc123.com>
X-SpamCatcher-Score: 1 [X]
Received: from [136.167.40.119] (HELO abc.com)
    by fe3.abc.com (CommuniGate Pro SMTP 4.1.8)
    with ESMTP-TLS id 61258719 for example_to@mail.abc.com;
Message-ID: <4129F3CA.2020509@abc.com>
Date: Wed, 21 Jan 2009 12:52:00 -0500 (EST)
From: Taylor Evans <Remember@To.Vote>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.0.1)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Jon Smith <example_to@mail.abc.com>
Subject: Business Development Meeting
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Type: multipart/alternative;
boundary="------------060102080402030702040100"
This is a multi-part message in MIME format.
--------------060102080402030702040100
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
Hello,
this is an HTML mail, it has *bold*, /italic /and _underlined_ text.
And then we have a table here:
Cell(1,1)
Cell(2,1)
Cell(1,2) Cell(2,2)
And we put a picture here:
Image Alt Text
That''s it.
--------------060102080402030702040100
Content-Type: multipart/related;
boundary="------------030904080004010009060206"
--------------030904080004010009060206
Content-Type: text/html; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-15">
</head>
<body bgcolor="#ffffff" text="#000000">
Hello,<br>
<br>
this is an HTML mail, it has <b>bold</b>, <i>italic </i>and <u>underlined</u>
text.<br>
And then we have a table here:<br>
<table border="1" cellpadding="2" cellspacing="2" height="62"
width="401">
<tbody>
<tr>
<td valign="top">Cell(1,1)<br>
</td>
<td valign="top">Cell(2,1)</td>
</tr>
<tr>
<td valign="top">Cell(1,2)</td>
<td valign="top">Cell(2,2)</td>
</tr>
</tbody>
</table>
<br>
And we put a picture here:<br>
<br>
<img alt="Image Alt Text"
src="cid:part1.FFFFFFFF.5555555@example.com" height="79"
width="98"><br>
<br>
That''s it. email me at test@email.com<br>
Subject: <br>
</body>
</html>'

    # Write-Host start with 
# write-host $String
Write-Host
Write-Host found
[array]$Found = ([regex]'(?<!Content-Type(.|\n){0,10000000})([a-zA-Z0-9.!#$%&''*+-/=?\^_`{|}~-]+@(?!abc.com)[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*)').matches($String) 

$Found | foreach {
    write-host "key at $($_.Groups[1].Index) = '$($_.Groups[1].Value)'"
    } # next match
Write-Host "found $($Found.count) matching addresses"

Yields

found
key at 14 = 'example_from@abc123.com'
key at 200 = 'example_to@mail.abc.com'
key at 331 = 'Remember@To.Vote'
key at 485 = 'example_to@mail.abc.com'
found 4 matching addresses

Summary

  • (?<!Content-Type(.|\n){0,10000000}) prevents Content-Type from appearing within the 10,000,000 characters before the email address. This has the effect of preventing email address matches which are in the body of the message. Because the requester is using Java and Java doesn't support the use a * inside a lookbehind I'm using {0,10000000} instead. (see also Regex look behind without obvious maximum length in Java). Be aware this may introduce some edge cases which may not be captured as expected.
  • <(.*?@(?!abc.com).*?)>
    • ( start return
    • [a-zA-Z0-9.!#$%&''*+-/=?\^_``{|}~-]+ match 1 or more allowed characters. the double single quote is to escape the single quote character for powershell. And the double back tick escapes the backtick for stackoverflow.
    • @ include the first at sign
    • (?!abc.com) reject the find if it includes abc.com
    • [a-zA-Z0-9-]+ continue looking for all remaining characters non greedy upto the first dot or end of string.
    • (?:\.[a-zA-Z0-9-]+)*) continue looking for character chunks followed by a dot
Community
  • 1
  • 1
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • I would like to try to do this in a simple single .string.find line....if I could just isolate the header for email addresses that could help – user1181862 May 03 '13 at 03:14
  • Updated the answer to allow for a single line regex and no logic – Ro Yo Mi May 03 '13 at 03:36
  • Addresses are not guaranteed to be in wedges; `To: billg@example.com` is perfectly valid. – tripleee May 03 '13 at 05:54
  • After testing it seems the mail system doesn't add the <> around email addresses just "TO" and "CC" – user1181862 May 03 '13 at 12:57
  • Alrighty then. [RFC 2822 section 3.4](http://tools.ietf.org/html/rfc2822#section-3.4) does say that the address are enclosed by angle brackets, but it does go on to say that in an alternate simple form an address can appear alone without brackets. So I've updated this answer with a regex which will pull most common cases of email addresses, and pointed the requester to a page where they have an amazing regex which will probably match every case. – Ro Yo Mi May 03 '13 at 15:08
  • Thanks so much for your help - It's perfect in detecting email addresses but the issue now is isolating just the header and not detecting emails in the body – user1181862 May 04 '13 at 17:15
  • Java doesn't support lookbehinds with undefined lengths it would be better to first separate the email into two parts (header and content) then search the header portion. I updated the regex here to include `(?<!Content-Type(.|\n){0,10000000})` which will prohibit email addresses from being found after a `Content-Type` string. Of course in this case it'll only prevent it if there are less than 10 million characters between the email address and the `Content-Type` string. There are edge cases which will fail this test and may return emails addresses even after the `Content-Type` string. – Ro Yo Mi May 04 '13 at 20:42