1

Using Perl, how can I use a regex to take a string that has random HTML in it with one HTML link with anchor, like this:

  <a href="http://example.com" target="_blank">Whatever Example</a>

and it leave ONLY that and get rid of everything else? No matter what was inside the href attribute with the <a, like title=, or style=, or whatever. and it leave the anchor: "Whatever Example" and the </a>?

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
Richard Jones
  • 45
  • 1
  • 8
  • 4
    Can "whatever example" contain any HTML in it? If so, this is not a job for a regex. See here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454. As long as the text of the link is guaranteed to be just text, it is a reasonable task for a regex. There are already a bazillion answers to this kind of question, though. You should look through them, make an attempt yourself, then ask a question if you run into a problem. –  May 15 '15 at 08:19

2 Answers2

2

You can take advantage of a stream parser such as HTML::TokeParser::Simple:

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $html = <<EO_HTML;

Using Perl, how can I use a regex to take a string that has random HTML in it
with one HTML link with anchor, like this:

   <a href="http://example.com" target="_blank">Whatever <i>Interesting</i> Example</a>

       and it leave ONLY that and get rid of everything else? No matter what
   was inside the href attribute with the <a, like title=, or style=, or
   whatever. and it leave the anchor: "Whatever Example" and the </a>?
EO_HTML

my $parser = HTML::TokeParser::Simple->new(string => $html);

while (my $tag = $parser->get_tag('a')) {
    print $tag->as_is, $parser->get_text('/a'), "</a>\n";
}

Output:

$ ./whatever.pl
<a href="http://example.com" target="_blank">Whatever Interesting Example</a>
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
1

If you need a simple regex solution, a naive approach might be:

my @anchors = $text =~ m@(<a[^>]*?>.*?</a>)@gsi;

However, as @dan1111 has mentioned, regular expressions are not the right tool for parsing HTML for various reasons.

If you need a reliable solution, look for an HTML parser module.

Community
  • 1
  • 1
Chris Smeele
  • 966
  • 4
  • 8
  • 1
    Regexes are a perfectly good tool for finding a single link in HTML, as long as the anchor text is known to not have any HTML in it. Because then you are just doing *matching*, not *parsing*. –  May 15 '15 at 14:14