Parse specific text from html using Perl

Question

I have an html page that has particular text that I want to parse into a databse using a Perl Script.

I want to be able to strip off all the stuff I don't want, an exmple of the html is-

<div class="postbody">
        <h3><a href "foo">Re: John Smith <span class="posthilit">England</span></a></h3>
        <div class="content">Is C# better than Visula Basic?</div>
    </div>

Therefore I would want to import into the database

Name: John Smith.
Lives in: England.
Commented: Is C# better than Visula Basic?

I have started to create a Perl script but it needs to be changed to work for what I want;

    use DBI;

    open (FILE, "list") || die "couldn't open the file!";

    open (F1, ">list.csv") || die "couldn't open the file!";

    print F1 "Name\|Lives In\|Commented\n";

    while ($line=<FILE>)

    {

    chop($line);
    $text = "";
    $add = 0;
    open (DATA, $line) || die "couldn't open the data!";
    while ($data=<DATA>)

    {
    if ($data =~ /ds\-div/)
    {
    $data =~ s/\,//g;
    $data =~ s/\"//g;
    $data =~ s/\'//g;
    $text = $text . $data;
    }

    }

    @p = split(/\\/, $line);
    print F1 $p[2];
    print F1 ",";
    print F1 $p[1];
    print F1 ",";
    print F1 $p[1];
    print F1 ",";  

    print F1 "\n";
    $a = $a + 1;

Any input would be greatly appreciated.

You should not parse HTML with regular expressions, especially this sort of free-form HTML. Additionally, due to embedded newlines, I strongly suggest that you not read in the data line by line. Perhaps slurp the entire data depending on how big it is, or changing the newline separator to "" — Seth Robertson, Jul 06 '11 at 14:59
possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Chris J, Jul 06 '11 at 15:20

score 6 · Accepted Answer · edited May 23 '17 at 11:54

Please do not use regular expressions to parse HTML as HTML is not a regular language. Regular expressions describe regular languages.

It is easy to parse HTML with HTML::TreeBuilder (and its family of modules):

#!/usr/bin/env perl

use warnings;
use strict;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_content(
    do { local $/; <DATA> }
);

for ( $tree->look_down( 'class' => 'postbody' ) ) {
    my $location = $_->look_down( 'class' => 'posthilit' )->as_trimmed_text;
    my $comment  = $_->look_down( 'class' => 'content' )->as_trimmed_text;
    my $name     = $_->look_down( '_tag'  => 'h3' )->as_trimmed_text;
    $name =~ s/^Re:\s*//;
    $name =~ s/\s*$location\s*$//;

    print "Name: $name\nLives in: $location\nCommented: $comment\n";
}

__DATA__
<div class="postbody">
    <h3><a href="foo">Re: John Smith <span class="posthilit">England</span></a></h3>
    <div class="content">Is C# better than Visual Basic?</div>
</div>

Output

Name: John Smith
Lives in: England
Commented: Is C# better than Visual Basic?

However, if you require much more control, have a look at HTML::Parser as has already been answered by ADW.

Out of interest, I will be adding a connection to a database so hopefully I can import results straight in which im fairly sure I know how to do, however will this be possible by adapting this new script? — Ebikeneser, Jul 06 '11 at 15:47
If the HTML tag patterns are like the one you posted, it should be fine; else few tweaks *may* be needed. — Alan Haggai Alavi, Jul 06 '11 at 15:59

score 4 · Answer 2 · answered Jul 06 '11 at 15:28

Use an HTML parser, like HTML::TreeBuilder to parse the HTML--don't do it yourself.

Also, don't use two-arg open with global handles, don't use chop--use chomp (read the perldoc to understand why). Find yourself a newer tutorial. You are using a ton of OLD OLD OLD Perl. And damnit, USE STRICT and USE WARNINGS. I know you've been told to do this. Do it. Leaving it out will do nothing but buy you pain.

Go. Read. Modern Perl. It is free.

my $page = HTML::TreeBuilder->new_from_file( $file_name );
$page->elementify;

my @posts;
for my $post ( $page->look_down( class => 'postbody' ) ) {

    my %post = (
        name    => get_name($post),
        loc     => get_loc($post),
        comment => get_comment($post),
    );

    push @posts, \%post;
}

# Persist @posts however you want to.

sub get_name {
    my $post = shift;
    my $name = $post->look_down( _tag => 'h3' );
    return unless defined $name;

    $name->look_down->(_tag=>'a');
    return unless defined $name;        

    $name = ($name->content_list)[0];
    return unless defined $name;        

    $name =~ s/^Re:\s*//;
    $name =~ /\s*$//;

    return $name;
}

sub get_loc {
    my $post = shift;
    my $loc = $post->look_down( _tag => 'span', class => 'posthilit' );

    return unless defined $loc;

    return $loc->as_text;
}

sub get_comment {
    my $post = shift;
    my $com = $post->look_down( _tag => 'div', class => 'content' );

    return unless defined $com;

    return $com->as_text;
}

Now you have a nice data structure with all your post data. You can write it to CSV or a database or whatever it is you really want to do. You seem to be trying to do both.

Great code, yes I intend on adding it into a databse straight from the command, however feel this may provide a few stumbling blocks. If you can advise on which direction to take it would be appreciated. — Ebikeneser, Jul 06 '11 at 15:55
@Lambo, to work with a database in Perl, use the DBI module. To get the best results, figure out what database you want to use (MySQL, Postgres, SQLite, etc), and make sure you have the driver (DBD::DbName) installed. Install and configure the DB as needed. Then figure out what the connection string (read the DBD::Foo docs) needs to be. Once you know that, you can get a db handle in your code. Now, you can run whatever SQL you want. Pay extra attention to the info on how to use *placeholders* in your SQL. If you get stuck, ask a question like "How do I use DBI to add data to a database?" — daotoad, Jul 07 '11 at 16:40

score 1 · Answer 3 · edited Jul 06 '11 at 15:27

1

You'd be much better using the HTML::Parser module from the CPAN.

edited Jul 06 '11 at 15:27

Alan Haggai Alavi

72,802
19
102
127

answered Jul 06 '11 at 14:59

ADW

4,030
17
13

Parse specific text from html using Perl

3 Answers3

Output