Mixed results with perl regex, matching list of phrases in html code

Question

Mixed results with regex, matching list of phrases in html code

This new post was in response to another post, Perl Regex match lines that contain multiple words, but was, for reasons unknown to me, deleted by the moderator. It seemed logical to me to ask the question in the original thread because it has to do with an attempt to use the solution given early on in that thread, and a problem with it. There was a generic reference to the faq, which didn't seem to reveal any discrepancies, and the message, "If you have a question, please post your own question." Hence this post.

I am using LWP::Simple to get a web page and then trying to match lines that contain certain phrases. I copied the regex in answer #1 in the above-mentioned thread, and replaced/added words that I need to match, but I am getting mixed results with two similar but different web pages.

The regex I am using is:

/^(?=.*?\bYear\b)(?=.*?\bNew Moon\b)(?=.*?\bFirst Quarter\b)(?=.*?\bFull Moon\b)(?=.*?\bLast Quarter\b).*$/gim

For web site #1, which has bare lines containing these words, in a series of blocks surrounded by <pre>..</pre> tags, it matches all lines exactly equal to this one, as expected:

 Year        New Moon       First Quarter       Full Moon       Last Quarter

BUT for web site #2, which has nasty little tags surrounding the words:

<br><br><span class="prehead"> Year      New Moon       First Quarter       Full Moon       Last Quarter          &#916;T</span><br>

it matches EVERY line!

I'm sure the <span> tags are the "proper" way to do this but I am wondering how to get around those tags so I can have just one regex for both sites. Is there a simple way to do this or do I have to learn how to parse html (something I'd rather not have to do)?

I'm looking for a quick solution, not a robust one. This is probably a one-time-only deal. If these relatively static pages change, it will probably be minor and easy to fix. Please don't refer me to all the 'anti-regex-for-html' pages. I've seen 'em. And please don't make me use HTML::TreeBuilder. Oh please...

Obligatory cross-reference to [The HTML Regex Question](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). Have you tried using an XML parser instead? — Roddy of the Frozen Peas, Mar 01 '13 at 19:37
Yah well, that's just the post I was referring to at the end of my OP. I found the rant/manifesto very amusing really, but not that helpful IMO. I did get a lot of laughs out of it though. And I did mention that I'd rather not have to try and use HTML::TreeBuilder. No no, please! — hmj6jmh, Mar 01 '13 at 19:42
This is a perl question, not python. I'm sure that works great but just not with perl. Not easily any way. I tried to change the title so there is no ambiguity for people who don't read the first line of the OP, but I was not allowed to. :) — hmj6jmh, Mar 01 '13 at 19:49
@MikeM: That's exactly what you would expect but that is far from the case. See for yourself: [Phases of the Moon: 2001 to 2100](http://eclipse.gsfc.nasa.gov/phase/phases2001.html). Loop I'm using: `while() { next unless /^(?=.*?\bYear\b)(?=.*?\bNew Moon\b)(?=.*?\bFirst Quarter\b)(?=.*?\bFull Moon\b)(?=.*?\bLast Quarter\b).*$/gim; print; }` — hmj6jmh, Mar 01 '13 at 20:10
I tried your code on the html, it matches the lines containing the words as expected - it does not match "EVERY line". Do you want to match the lines with you chosen words in span tags or not? — MikeM, Mar 01 '13 at 22:13

Jake Jackson · Answer 1 · 2013-03-01T21:37:20.233

If I am correct in my assumption, you would like to match only the specific sequence of words:

Year        New Moon       First Quarter       Full Moon       Last Quarter

with free spacing regardless of the tags at the ends.

We can use this to match any properly formatted opening and closing tags at either end

<[^>]*?>

Which means, any string that is between an opening "<" and the first closing ">",

Next we want to make sure we allow for spaces between those tags so we use the whitespace indicator "\s*" for zero or more whitespace at either end:

\s*<[^>]*?>\s*

Next we want to group that in a non-capturing (for efficiency) group and let it repeat zero or more times. This is what we will put at either end of the regex to make sure the tags are matched:

(?:\s*<[^>]*?>\s*)*

Then we will fill in the desired text using the "\s*" between phrases to make sure space and only space is allowed between them:

(?:\s*<[^>]*?>\s*)*\s*Year\s*New Moon\s*First Quarter\s*Full Moon\s*Last Quarter\s*(?:\s*<[^>]*?>\s*)*

Then finish off with the line beginning and end line markers

/^(?:\s*<[^>]*?>\s*)*\s*Year\s*New Moon\s*First Quarter\s*Full Moon\s*Last Quarter\s*(?:\s*<[^>]*?>\s*)*$/gim

This should match any lines containing an arbitrary number of tags at either end of the desired phrases, but not match if anything else comes in such as additional characters. It should also be pretty efficient because it doesn't use any look-arounds. Let me know if I misunderstood the question though.

score 0 · Accepted Answer · answered Mar 02 '13 at 06:32

I finally got this working for both urls using the original regex by looping through the retrieved html document directly:

for my $line (split qr/\R/, $doc)
{
    next unless $line =~ /^(?=.*?\bYear\b)(?=.*?\bNew Moon\b)(?=.*?\bFirst Quarter\b)(?=.*?\bFull Moon\b)(?=.*?\bLast Quarter\b).*$/gim; # original
    print "$line\n";
}

It really shouldn't be this difficult. ;-)

score 0 · Answer 3 · answered Mar 02 '13 at 16:39

@Jake:

Hey thanks a lot for this. You are the person I am looking for. I tried it and it works with the first url but outputs nothing for the second one.

Using my original regex, I also tried stripping the html tags with HTML::TreeBuilder:

my $tree = HTML::TreeBuilder->new;
$tree->parse_file($doc);
my $non_html = $tree->as_text();
open FILE, "<", \$non_html or die "can't open $non_html: $!\n";

with no results for either url.

I tried HTML::Strip:

my $hs = HTML::Strip->new();
my $clean_text = $hs->parse($doc);
$hs->eof;
open FILE, "<", \$clean_text or die "can't open $clean_text: $!\n";

with same results as original--first url works as expected, second one outputs all (stripped) lines. Maybe there is a problem with my code here. I don't know.

Here is the essence of my script (this runs):

use strict;
use warnings;
use LWP::Simple;

my $url = 'http://eclipse.gsfc.nasa.gov/phase/phases2001.html';
#my $url = 'http://www.astropixels.com/ephemeris/moon/phases2001gmt.html';
my $doc = get $url;
die "Couldn't get $url" unless defined $doc;
open FILE, "<", \$doc or die "can't open $doc: $!\n";

while(my $line = <FILE>)
{
    #next unless $line =~ /^(?=.*?\bYear\b)(?=.*?\bNew Moon\b)(?=.*?\bFirst Quarter\b)(?=.*?\bFull Moon\b)(?=.*?\bLast Quarter\b).*$/gim; # original
    next unless $line =~ /^(?:\s*<[^>]*?>\s*)*\s*Year\s*New Moon\s*First Quarter\s*Full Moon\s*Last Quarter\s*(?:\s*<[^>]*?>\s*)*$/gim; # Jake's
    print "$line";
}

Mixed results with perl regex, matching list of phrases in html code

3 Answers3