Regex to match HTML with or without link

Question

I would like to be able to get "Target" out of this block of HTML when it appears in a page:

<h3>
    <a href="http://link">              Target
    </a>            </h3>

I can count on the spacing being reliably there. What I can't count on is that "Target" will always be included in an anchor tag. Sometimes, it looks like this:

<h3>
                    Target
                </h3>

I can match the first version and extract "Target" pretty easily with this regex:

/<h3>\s+<a href=.*>\s+(.*)\s+<\/a>\s+<\/h3>/

But I'm struggling to write one that will match both. Any ideas?

A big no no to matching html tag with RegEx.... see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — nafas, Dec 08 '15 at 13:47
An HTML parser would allow you to simply enumerate H3 looking for inner text of "Target" — Alex K., Dec 08 '15 at 13:48

score 9 · Answer 1 · answered Dec 08 '15 at 13:49

9

Don't use regular expressions to parse HTML. It is more painful then it is worth in most cases. Use a library designed to parse HTML.

#!/usr/bin/perl

use v5.16;
use strict;
use warnings;
use HTML::TreeBuilder;

my $data = qq{<body><h3>
<a href="http://link">              Target
</a>            </h3></body>
};

my $otherdata = qq{<body><h3>
              Target
            </h3></body>
};

my $t = HTML::TreeBuilder->new_from_content($data);
say $t->look_down(_tag => "h3")->as_text();


$t = HTML::TreeBuilder->new_from_content($otherdata);
say $t->look_down(_tag => "h3")->as_text();

answered Dec 08 '15 at 13:49

Quentin

914,110
126
1,211
1,335

I'm not parsing a whole HTML document, just grabbing text out of little snippets. In this case, it makes sense, even if it's usually a bad practice. – John Chrysostom Dec 08 '15 at 14:31
2

HTML parsers are still usually better at handling little snippets of HTML than regex. – Quentin Dec 08 '15 at 14:34

score 0 · Answer 2 · answered Dec 08 '15 at 14:06

0

Just to put my two cents in, why not use an xpath query with a decent Dom library?

//html/body/h3/text()[contains(.,'Target')

The actual query may vary depending upon your html structure.

answered Dec 08 '15 at 14:06

Jan

42,290
8
54
79

Victoria S. · Accepted Answer · 2015-12-08T13:55:34.443

-1

Try this one as a regex:

<h3>\s+(<a href=.*>)?\s+(.*)\s+(<\/a>)?\s+<\/h3>

It should match both your cases.

Even though this is not a recommended way to search html, if this is what you want to try, I won't stop you.

edited Dec 08 '15 at 13:55

answered Dec 08 '15 at 13:48

Victoria S.

549
1
6
20

2

Instead of just writing "Try this or that", it is always better to actually say *why* the OPs should use it. – Jan Dec 08 '15 at 13:53
OP asks for help with his regex, I helped with his regex. What exactly should it explain other than "here's a regex you can try"? – Victoria S. Dec 08 '15 at 19:24

Regex to match HTML with or without link

3 Answers3