5

I have wrote a script to grab different fields in an HTML file and populate variables with the results. I'm having issues with the regular expression for grabbing the email. Here is some sample code:

$txt='<p class=FillText><a name="InternetMail_P3"></a>First.Last@company-name.com</p>'

$re='.*?'+'([\\w-+]+(?:\\.[\\w-+]+)*@(?:[\\w-]+\\.)+[a-zA-Z]{2,7})'

if ($txt -match $re)
{
    $email1=$matches[1]
    write-host "$email1"
}

I get the following error:

Bad argument to operator '-match': parsing ".*?([\\w-+]+(?:\\.[\\w-+]+)*@(?:[\\w-]+\\
.)+[a-zA-Z]{2,7})([\\w-+]+(?:\\.[\\w-+]+)*@(?:[\\w-]+\\.)+[a-zA-Z]{2,7})" - [x-y] range in reverse order..
At line:7 char:16
+ if ($txt -match <<<<  $re)
    + CategoryInfo          : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : BadOperatorArgument

What am I missing here? Also, is there a better regex for email?

Thanks in advance.

chiwangc
  • 3,566
  • 16
  • 26
  • 32
gp80586
  • 169
  • 2
  • 4
  • 13

2 Answers2

11

Actually any regex that is suitable for .Net or C# will work for PowerShell. And you could find tons and tons samples at stackoverflow and inet. For example: How to Find or Validate an Email Address: The Official Standard: RFC 2822

$txt='<p class=FillText><a name="InternetMail_P3"></a>First.Last@company-name.com</p>'
$re="[a-z0-9!#\$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#\$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?"
[regex]::MAtch($txt, $re, "IgnoreCase ")

But there is also other part of this answer. Regex by nature is not very suitable to parse XML/HTML. You could find more details here: Using regular expressions to parse HTML: why not?

To provide real solution, I'm recomment first

  1. convert HTML → XHTML
  2. walk over XML tree
  3. work with individual nodes one by one, even using regex.
Community
  • 1
  • 1
Akim
  • 8,469
  • 2
  • 31
  • 49
  • Thanks. I ended up utilizing PoshHTTP (http://huddledmasses.org/get-web-another-round-of-wget-for-powershell/) to convert to XML in the script – gp80586 Jul 23 '12 at 15:44
2

When it comes to email validation I usually choose the short version of RFC 2822 being:

[a-z0-9!#$%&'*+/=?^_{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_{|}~-]+)*@(?:a-z0-9?.)+a-z0-9?

You can find more info about email validation here

Pierluc SS
  • 3,138
  • 7
  • 31
  • 44