How can I extract information from an HTML file using Perl regular expressions?

Question

I have two files, XML and an HTML and need to extract data from these on certain patterns.

My XML file is pretty well formatted and I can use readline to read a line and search data between tags.

if($line =~ /\<tag1\>$varvalue\<\/tag1\>/)`

However, for my HTML, it has one of the worst code I have seen and the file is like:

<div class="theater">
    <h2>
    <a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
    <div class="address">
        <i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
    </div>
</div>

<div class="mtitle">
    <a href="/movie/dream-house-2011"  title="Dream House" onmouseover="mB(event, 771204354);"  >**Dream House**</a>
    <span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>

<div class="times">

    **1:00 PM,**
</div>

Now from this file I need to pick data which is shown in bold.

I can use Perl regular expression to search data from this file.

Take a look here: http://stackoverflow.com/questions/7612778/get-td-values-with-perl/7612978#7612978 — stivlo, Oct 16 '11 at 11:14
I don't see any `b` tags. Are the `**`-delimited chunks supposed to be shown in bold? — Greg Bacon, Oct 16 '11 at 11:38

score 6 · Answer 1 · edited May 23 '17 at 12:26

6

RegEx match open tags except XHTML self-contained tags

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

Using regular expressions to parse HTML: why not?

When you are done reading those come back :)

Edit : and to actually solve your problem take a look at this module :

http://perlmeme.org/tutorials/html_parser.html

Some sample to parse the an html file :

#!/usr/local/bin/perl

use HTML::TreeBuilder;

$tree = HTML::TreeBuilder->new;
$tree->parse_file('C:\Users\Stefanos\workspace\HTML_Parser_Test\test.html');

@divs = $tree->find('div');

$tree->delete;

In this example I just used your tags as the main body of an .html file. The divs are stored in the @divs array. Since I have no idea which text you want to find, because ** is not a element I can't help you further..

P.S. I have never used this module but I just did it in 5 minutes so it is not so hard to parse the html file and find whatever you want..

Regex to match any specific tag and store of contents result into $1:

if ($subject =~ m!<tagname[^>]*>(.*?)</tagname>!s) {
    # Successful match
}

Although you will soon realize the limitations of this approach when you have nested elements..

Replace tagname with actual tag.. e.g. in your case i, a, span, div although for div you will also get the contents of the first div which is not what you want..

edited May 23 '17 at 12:26

Community

1
1

answered Oct 16 '11 at 11:09

FailedDev

26,680
9
53
73

The thing is i need to do it only by regex and it seems it is only possible by libraries and parsers? – typedefcoder2 Oct 16 '11 at 11:20
@typedef1 If you do it with regexes only your solution will only be able to address a very specific problem and could easily break. Why is it so bad to use a library? There most of the legwork has been done for you. – FailedDev Oct 16 '11 at 11:26
Requirements of my project... i have been smashing my head on diff things and combinations... though i read even Jon Skeet can not do it, i believe there must be something for me. @FailedDEv – typedefcoder2 Oct 16 '11 at 11:39
It's not whether or not Jon Skeet _can_ do it, he _wouldn't_. – Donal Fellows Oct 16 '11 at 12:00
@DonalFellows :) Anyways you have some clue / hint for me? – typedefcoder2 Oct 16 '11 at 12:07
@Failed - what do you mean by "requirements"? There are very very few cases where there' a requirement NOT to use a correct library. – DVK Oct 16 '11 at 12:19
@DVK The project requires implementation of regex only. The BOSS wants it in a tough way ... – typedefcoder2 Oct 16 '11 at 12:35
3

@type - did you talk to the boss and explain that **you should never parse HTML - which is a non-regular grammar - with Regular Expressions? ANd that if you use a proper well tested library, you will Get It Right, whereas if you write your own half-assed parser, it will most likely contain bugs (it will, trust me), AND be extremely costly to debug like any complicated RegEx? – DVK Oct 16 '11 at 12:37
@FailedDev - meh. Sorry. Too early in the morning. – DVK Oct 16 '11 at 12:38
@DVK Well yeah, and they asked they want us to have a proper understndng of regex .. that's why. Somehow this thing is solvable via regex i guess. So its morning now.. :( seems another sleepless night went figuring out these things – typedefcoder2 Oct 16 '11 at 12:45
@typedef1 If you INSIST solving this with regex you have to provide more input. As it stands now I have no idea what you consider bold? If it's only the stuff inside ** ** this is a joke. I guess you want to take the content out of specific tags.. – FailedDev Oct 16 '11 at 12:47
Yes indeed i do want to take the content out of specific tags only.... the one between ** ** @DVK – typedefcoder2 Oct 16 '11 at 13:36
@typedef1 If this indeed is homework, you should have mentioned that upfront, not in the umpteenth comment. – Sinan Ünür Oct 17 '11 at 01:41
@SinanÜnür OK i will take care of thaat in future.. but would that have been beneficial? I just wanted to know what difference it could have made.. Thanks – typedefcoder2 Oct 17 '11 at 05:53

score 0 · Answer 2 · edited May 23 '17 at 09:59

Parsing XML and HTML using regular expressions is a fool's errand. There are many simple to use Perl modules for parsing HTML. Here is something using HTML::TokeParser::Simple. I've omitted the code to associate movies and showtimes with theaters (because I have no intention of building an appropriate input file):

#!/usr/bin/env perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);

my @theaters;

while (my $div = $parser->get_tag('div')) {
    my $class = $div->get_attr('class');
    next unless defined($class) and $class eq 'theater';

    my %record;

    $record{theater} = $parser->get_text('/a');
    $record{address} = $parser->get_text('/i');

    s{(?:^\s+)|(?:\s+\z)}{} for values %record;

    push @theaters, \%record;
}

use YAML;
print Dump \@theaters;

__DATA__
<div class="theater">
    <h2>
    <a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
    <div class="address">
        <i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
    </div>
</div>

<div class="mtitle">
    <a href="/movie/dream-house-2011"  title="Dream House" onmouseover="mB(event, 771204354);"  >**Dream House**</a>
    <span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>

<div class="times">

    **1:00 PM,**
</div>

<div class="theater">
    <h2>
    <a href="/showtimes/university-village-3" >**Some other theater*</a></h2>
    <div class="address">
        <i>**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**</i>
    </div>
</div>

Output:

[sinan@macardy]:~/tmp> ./tt.pl
---
- address: '**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**'
  theater: '**University Village 3**'
- address: '**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**'
  theater: '**Some other theater*'

To match this code University Village 3 if($line =~ /\ – typedefcoder2 Oct 17 '11 at 12:36 — typedefcoder2, Oct 17 '11 at 12:36

How can I extract information from an HTML file using Perl regular expressions?

2 Answers2