Consider the following powershell example of a universal regex.
To find all email addresses:
<(.*?)>
is handy if your server surrounds the email addresses with brackets
(?<!Content-Type(.|\n){0,10000000})([a-zA-Z0-9.!#$%&''*+-/=?\^_``{|}~-]+@(?!abc.com)[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*)
if you don't have brackets around all email addresses in your header. Note this particular regex was copied from a community wiki answer on stackoverflow 201323 and modified here to prevent @abc.com
. There are probably some edge cases which this regex will not work for. So on the same page there is really complex regex which looks like it would match every email address. I don't have the time to modify that one to skip @abc.com
.
Example
$Matches = @()
$String = 'Return-Path: <example_from@abc123.com>
X-SpamCatcher-Score: 1 [X]
Received: from [136.167.40.119] (HELO abc.com)
by fe3.abc.com (CommuniGate Pro SMTP 4.1.8)
with ESMTP-TLS id 61258719 for example_to@mail.abc.com;
Message-ID: <4129F3CA.2020509@abc.com>
Date: Wed, 21 Jan 2009 12:52:00 -0500 (EST)
From: Taylor Evans <Remember@To.Vote>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.0.1)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: Jon Smith <example_to@mail.abc.com>
Subject: Business Development Meeting
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Type: multipart/alternative;
boundary="------------060102080402030702040100"
This is a multi-part message in MIME format.
--------------060102080402030702040100
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
Hello,
this is an HTML mail, it has *bold*, /italic /and _underlined_ text.
And then we have a table here:
Cell(1,1)
Cell(2,1)
Cell(1,2) Cell(2,2)
And we put a picture here:
Image Alt Text
That''s it.
--------------060102080402030702040100
Content-Type: multipart/related;
boundary="------------030904080004010009060206"
--------------030904080004010009060206
Content-Type: text/html; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-15">
</head>
<body bgcolor="#ffffff" text="#000000">
Hello,<br>
<br>
this is an HTML mail, it has <b>bold</b>, <i>italic </i>and <u>underlined</u>
text.<br>
And then we have a table here:<br>
<table border="1" cellpadding="2" cellspacing="2" height="62"
width="401">
<tbody>
<tr>
<td valign="top">Cell(1,1)<br>
</td>
<td valign="top">Cell(2,1)</td>
</tr>
<tr>
<td valign="top">Cell(1,2)</td>
<td valign="top">Cell(2,2)</td>
</tr>
</tbody>
</table>
<br>
And we put a picture here:<br>
<br>
<img alt="Image Alt Text"
src="cid:part1.FFFFFFFF.5555555@example.com" height="79"
width="98"><br>
<br>
That''s it. email me at test@email.com<br>
Subject: <br>
</body>
</html>'
# Write-Host start with
# write-host $String
Write-Host
Write-Host found
[array]$Found = ([regex]'(?<!Content-Type(.|\n){0,10000000})([a-zA-Z0-9.!#$%&''*+-/=?\^_`{|}~-]+@(?!abc.com)[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*)').matches($String)
$Found | foreach {
write-host "key at $($_.Groups[1].Index) = '$($_.Groups[1].Value)'"
} # next match
Write-Host "found $($Found.count) matching addresses"
Yields
found
key at 14 = 'example_from@abc123.com'
key at 200 = 'example_to@mail.abc.com'
key at 331 = 'Remember@To.Vote'
key at 485 = 'example_to@mail.abc.com'
found 4 matching addresses
Summary
(?<!Content-Type(.|\n){0,10000000})
prevents Content-Type
from appearing within the 10,000,000 characters before the email address. This has the effect of preventing email address matches which are in the body of the message. Because the requester is using Java and Java doesn't support the use a *
inside a lookbehind I'm using {0,10000000}
instead. (see also Regex look behind without obvious maximum length in Java). Be aware this may introduce some edge cases which may not be captured as expected.
<(.*?@(?!abc.com).*?)>
(
start return
[a-zA-Z0-9.!#$%&''*+-/=?\^_``{|}~-]+
match 1 or more allowed characters. the double single quote is to escape the single quote character for powershell. And the double back tick escapes the backtick for stackoverflow.
@
include the first at sign
(?!abc.com)
reject the find if it includes abc.com
[a-zA-Z0-9-]+
continue looking for all remaining characters non greedy upto the first dot or end of string.
(?:\.[a-zA-Z0-9-]+)*)
continue looking for character chunks followed by a dot