0

I'm triyng to extract some url, it can be more than one, that come in a body email.

And i'm trying to parse the urls, with this:

use strict;
use warnings;
use Net::IMAP::Simple;
use Email::Simple;
use IO::Socket::SSL;

# here must be the connection to imap hidden for economize space

my $es = Email::Simple->new( join '', @{ $imap->get($i) } );
my $text = $es->body;
print $text;
my $matches = ($text =~/<a[^>]*href="([^"]*)"[^>]*>.*<\/a>/);
print $matches;

On $text i have the next text:

 --047d7b47229eb3d9f404e58fd90a
    Content-Type: text/plain; charset=ISO-8859-1

    Try1 <http://www.washingtonpost.com/>

    Try2 <http://www.thesun.co.uk/sol/homepage/>

    --047d7b47229eb3d9f404e58fd90a
    Content-Type: text/html; charset=ISO-8859-1

    <div dir="ltr"><a href="http://www.washingtonpost.com/">Try1</a><br><div><br></div><div><a href="http://www.thesun.co.uk/sol/homepage/">Try2</a><br></div></div>

    --047d7b47229eb3d9f404e58fd90a--

The output of the program, gives me a 1 ... just that.

Anyone can help??

Thanks in advice.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
snnifer
  • 15
  • 3
  • 3
    [*Don't* parse HTML with regexp](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Carsten Sep 04 '13 at 15:08
  • That module is useless if that's what it gives you for body. – ikegami Sep 04 '13 at 15:21
  • 1
    **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Sep 04 '13 at 16:24

2 Answers2

6

Email::Simple is not suitable for MIME messages. Use Courriel instead. Regex are not suitable for HTML parsing. Use Web::Query instead.

use Courriel qw();
use Web::Query qw();

my $email = Courriel->parse( text => join …);
my $html = $email->html_body_part;
my @url = Web::Query->new_from_html($html)->find('a[href]')->attr('href');
__END__
http://www.washingtonpost.com/
http://www.thesun.co.uk/sol/homepage/
daxim
  • 39,270
  • 4
  • 65
  • 132
2

The advice that you've been given about using a different email handling module and not parsing HTML with regular expressions is all good and you should definitely heed it.

But no-one has yet explained why your code is giving you incorrect results.

It's because you are calling the match operator in scalar context. In scalar context it returns a boolean value indicating whether or not the match succeeded. Hence the 1 (true) that you are getting.

To get the captures from the regex match, you need to call the match operator in list context. This could be as simple as this:

my ($matches) = ($text =~/<a[^>]*href="([^"]*)"[^>]*>.*<\/a>/);

But you might consider using an array in case you ever want to add /g to the match operator and get multiple matches.

my @matches = ($text =~/<a[^>]*href="([^"]*)"[^>]*>.*<\/a>/g);
Dave Cross
  • 68,119
  • 3
  • 51
  • 97
  • trying this, but when i print @matches, just get one of the urls. Right now, don't know if the other url is on matches variable or not... – snnifer Sep 04 '13 at 16:40
  • Working without regexp, although the regexp just pick one of the urls – snnifer Sep 04 '13 at 16:51
  • Yep. That's the problem with parsing HTML with regexes. It's always harder than you think it is. Where you have `>.*<\/a>` you need to change it to `>.*?<\/a>`. Working out why is left as an exercise for the reader :-) – Dave Cross Sep 04 '13 at 17:52