Extract text from HTML - Perl using HTML::TreeBuilder

Question

I'm trying to access the .html files and extract the text in <p> tags. Logically, my code below should work. By using the HTML::TreeBuilder. I parse the html then extract text in <p> using find_by_attribute("p"). But my script came out with empty directories. Did i leave out anything?

#!/usr/bin/perl

use strict;
use HTML::TreeBuilder 3;
use FileHandle;

my @task = ('ar','cn','en','id','vn');

foreach my $lang (@task) {
mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
opendir (my $dir, "./$lang/") or die "$!";
my @files = grep (/\.html/,readdir ($dir));
closedir ($dir);

foreach my $file (@files) {
    open (my $fh, '<', "./$lang/$file") or die "$!";
    my $root = HTML::TreeBuilder->new;
    $root->parse_file("./$lang/$file");
    my @all_p = $root->find_by_attribute("p");
    foreach my $p (@all_p) {
        my $ptag = HTML::TreeBuilder->new_from_content ($p->as_HTML);
        my $filewrite = substr($file, 0, -5); 
        open (my $outwrite, '>>', "extract_$lang/$filewrite.txt") or die $!;
        print $outwrite $ptag->as_text . "\n";  
        my $pcontents = $ptag->as_text;
        print $pcontents . "\n";
        close (outwrite);
    }
close (FH);
}
}

My .html files are the plain text htmls from .asp websites e.g. http://www.singaporemedicine.com/vn/hcp/med_evac_mtas.asp

My .html files are saved in:

./ar/*
./cn/*
./en/*
./id/*
./vn/*

daxim · Accepted Answer · 2011-12-20T10:37:27.763

5

You are confusing element with attribute. The program can be written much more concisely:

#!/usr/bin/env perl
use strictures;
use File::Glob qw(bsd_glob);
use Path::Class qw(file);
use URI::file qw();
use Web::Query qw(wq);
use autodie qw(:all);

foreach my $lang (qw(ar cn en id vn)) {
    mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
    foreach my $file (bsd_glob "./$lang/*.html") {
        my $basename = file($file)->basename;
        $basename =~ s/[.]html$/.txt/;
        open my $out, '>>:encoding(UTF-8)', "./extract_$lang/$basename";
        $out->say($_) for wq(URI::file->new_abs($file))->find('p')->text;
        close $out;
    }
}

edited Dec 20 '11 at 10:37

answered Dec 19 '11 at 13:26

daxim

39,270
4
65
132

I'm getting a warning message `Wide character in print at extract.pl line 24.` Is there a limitation to the TreeBuilder? It still prints out even though perl gives the warning, right? – alvas Dec 19 '11 at 13:33
1

You must specify the text output encoding. Compare how I open the output file with how you do it. Learn about the topic of encoding in Perl at http://p3rl.org/UNI – daxim Dec 19 '11 at 13:51
I tried with your code but i got a compilation error at `use strictures`, i'm getting errors at the other `use` attributes too. Do i need to install a new perl to make them work? – alvas Dec 19 '11 at 14:18
Error: `Can't locate strictures.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.12.4 /usr/local/share/perl/5.12.4 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.12 /usr/share/perl/5.12 /usr/local/lib/site_perl .) at extract-daxim.pl line 3.` – alvas Dec 19 '11 at 15:23
Replace `use strictures;` with `use strict; use warnings;` Or install the `strictures` distribution from CPAN. – friedo Dec 19 '11 at 15:43
From the [Stack Overflow Perl FAQ](http://stackoverflow.com/questions/tagged/perl?sort=faq): [What's the easiest way to install a missing Perl module?](http://stackoverflow.com/q/65865) – daxim Dec 19 '11 at 15:58
ah now i get it! install CPAN on my machine. – alvas Dec 19 '11 at 16:23
@daxim, i'm still getting an error where it says `Bareword found where operator expected at extract-daxim.pl line 14, near "s/[.]html$/.txt/r" ` is it the tilde ~? – alvas Dec 20 '11 at 05:59
i don't understand this part of the code ` . file($file)->basename =~ s/[.]html$/.txt/r;` the file and basename has no `my` or `$`, are they from the Path::Class and File::Glob ?? – alvas Dec 20 '11 at 06:01
2er0, [the `r` modifier is available only in Perl 5.13.2 or better](http://p3rl.org/perl5132delta#Non-destructive-substitution). Either install an additional stable Perl 5.14.x ([perlbrew](http://perlbrew.pl) makes this easy), or use the revised version of the code above that works around the lack of `r`. — `file` is an exported function, `basename` is a method. Both are documented in [Path::Class::File](http://p3rl.org/Path::Class::File). Please read an intermediate level Perl teaching book to understand the underlying concepts. – daxim Dec 20 '11 at 10:44

score 3 · Answer 2 · answered Dec 19 '11 at 13:05

3

Use find_by_tag_name to search for tag names, not find_by_attribute.

answered Dec 19 '11 at 13:05

choroba

231,213
25
204
289

score 3 · Answer 3 · answered Dec 19 '11 at 13:05

3

You want find_by_tag_name, not find_by_attribute:

my @all_p = $root->find_by_tag_name("p");

From the docs:

$h->find_by_tag_name('tag', ...)

In list context, returns a list of elements at or under $h that have any of the specified tag names. In scalar context, returns the first (in pre-order traversal of the tree) such element found, or undef if none.

answered Dec 19 '11 at 13:05

Eugene Yarmash

142,882
41
325
378

does it mean if the
tag is embedded, i might need to rerun the loop again? e.g. `
...
...<\p>...<\p>`
– alvas Dec 19 '11 at 13:13
@2er0 this method will return all `p` elements at once. You can use it on the resulting elements in turn to find nested `p`s. – Eugene Yarmash Dec 19 '11 at 13:29

score 1 · Answer 4 · answered Dec 19 '11 at 15:34

1

You might want to take a look at Mojo::DOM which lets you use CSS selectors.

answered Dec 19 '11 at 15:34

Alexander Hartmaier

2,178
12
21

Extract text from HTML - Perl using HTML::TreeBuilder

4 Answers4