0

I would like to be able to get "Target" out of this block of HTML when it appears in a page:

<h3>
    <a href="http://link">              Target
    </a>            </h3>

I can count on the spacing being reliably there. What I can't count on is that "Target" will always be included in an anchor tag. Sometimes, it looks like this:

<h3>
                    Target
                </h3>

I can match the first version and extract "Target" pretty easily with this regex:

/<h3>\s+<a href=.*>\s+(.*)\s+<\/a>\s+<\/h3>/

But I'm struggling to write one that will match both. Any ideas?

John Chrysostom
  • 3,973
  • 1
  • 34
  • 50
  • A big no no to matching html tag with RegEx.... see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – nafas Dec 08 '15 at 13:47
  • An HTML parser would allow you to simply enumerate H3 looking for inner text of "Target" – Alex K. Dec 08 '15 at 13:48

3 Answers3

9

Don't use regular expressions to parse HTML. It is more painful then it is worth in most cases. Use a library designed to parse HTML.

#!/usr/bin/perl

use v5.16;
use strict;
use warnings;
use HTML::TreeBuilder;

my $data = qq{<body><h3>
<a href="http://link">              Target
</a>            </h3></body>
};

my $otherdata = qq{<body><h3>
              Target
            </h3></body>
};

my $t = HTML::TreeBuilder->new_from_content($data);
say $t->look_down(_tag => "h3")->as_text();


$t = HTML::TreeBuilder->new_from_content($otherdata);
say $t->look_down(_tag => "h3")->as_text();
Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • I'm not parsing a whole HTML document, just grabbing text out of little snippets. In this case, it makes sense, even if it's usually a bad practice. – John Chrysostom Dec 08 '15 at 14:31
  • 2
    HTML parsers are still usually better at handling little snippets of HTML than regex. – Quentin Dec 08 '15 at 14:34
0

Just to put my two cents in, why not use an xpath query with a decent Dom library?

//html/body/h3/text()[contains(.,'Target')

The actual query may vary depending upon your html structure.

Jan
  • 42,290
  • 8
  • 54
  • 79
-1

Try this one as a regex:

<h3>\s+(<a href=.*>)?\s+(.*)\s+(<\/a>)?\s+<\/h3>

It should match both your cases.

Even though this is not a recommended way to search html, if this is what you want to try, I won't stop you.

Victoria S.
  • 549
  • 1
  • 6
  • 20
  • 2
    Instead of just writing "Try this or that", it is always better to actually say *why* the OPs should use it. – Jan Dec 08 '15 at 13:53
  • OP asks for help with his regex, I helped with his regex. What exactly should it explain other than "here's a regex you can try"? – Victoria S. Dec 08 '15 at 19:24