0

I'm trying to access the .html files and extract the text in <p> tags. Logically, my code below should work. By using the HTML::TreeBuilder. I parse the html then extract text in <p> using find_by_attribute("p"). But my script came out with empty directories. Did i leave out anything?

#!/usr/bin/perl

use strict;
use HTML::TreeBuilder 3;
use FileHandle;

my @task = ('ar','cn','en','id','vn');

foreach my $lang (@task) {
mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
opendir (my $dir, "./$lang/") or die "$!";
my @files = grep (/\.html/,readdir ($dir));
closedir ($dir);

foreach my $file (@files) {
    open (my $fh, '<', "./$lang/$file") or die "$!";
    my $root = HTML::TreeBuilder->new;
    $root->parse_file("./$lang/$file");
    my @all_p = $root->find_by_attribute("p");
    foreach my $p (@all_p) {
        my $ptag = HTML::TreeBuilder->new_from_content ($p->as_HTML);
        my $filewrite = substr($file, 0, -5); 
        open (my $outwrite, '>>', "extract_$lang/$filewrite.txt") or die $!;
        print $outwrite $ptag->as_text . "\n";  
        my $pcontents = $ptag->as_text;
        print $pcontents . "\n";
        close (outwrite);
    }
close (FH);
}
}

My .html files are the plain text htmls from .asp websites e.g. http://www.singaporemedicine.com/vn/hcp/med_evac_mtas.asp

My .html files are saved in:

./ar/*
./cn/*
./en/*
./id/*
./vn/*
alvas
  • 115,346
  • 109
  • 446
  • 738

4 Answers4

5

You are confusing element with attribute. The program can be written much more concisely:

#!/usr/bin/env perl
use strictures;
use File::Glob qw(bsd_glob);
use Path::Class qw(file);
use URI::file qw();
use Web::Query qw(wq);
use autodie qw(:all);

foreach my $lang (qw(ar cn en id vn)) {
    mkdir "./extract_$lang", 0777 unless -d "./extract_$lang";
    foreach my $file (bsd_glob "./$lang/*.html") {
        my $basename = file($file)->basename;
        $basename =~ s/[.]html$/.txt/;
        open my $out, '>>:encoding(UTF-8)', "./extract_$lang/$basename";
        $out->say($_) for wq(URI::file->new_abs($file))->find('p')->text;
        close $out;
    }
}
daxim
  • 39,270
  • 4
  • 65
  • 132
  • I'm getting a warning message `Wide character in print at extract.pl line 24.` Is there a limitation to the TreeBuilder? It still prints out even though perl gives the warning, right? – alvas Dec 19 '11 at 13:33
  • 1
    You must specify the text output encoding. Compare how I open the output file with how you do it. Learn about the topic of encoding in Perl at http://p3rl.org/UNI – daxim Dec 19 '11 at 13:51
  • I tried with your code but i got a compilation error at `use strictures`, i'm getting errors at the other `use` attributes too. Do i need to install a new perl to make them work? – alvas Dec 19 '11 at 14:18
  • Error: `Can't locate strictures.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.12.4 /usr/local/share/perl/5.12.4 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.12 /usr/share/perl/5.12 /usr/local/lib/site_perl .) at extract-daxim.pl line 3.` – alvas Dec 19 '11 at 15:23
  • Replace `use strictures;` with `use strict; use warnings;` Or install the `strictures` distribution from CPAN. – friedo Dec 19 '11 at 15:43
  • From the [Stack Overflow Perl FAQ](http://stackoverflow.com/questions/tagged/perl?sort=faq): [What's the easiest way to install a missing Perl module?](http://stackoverflow.com/q/65865) – daxim Dec 19 '11 at 15:58
  • ah now i get it! install CPAN on my machine. – alvas Dec 19 '11 at 16:23
  • @daxim, i'm still getting an error where it says `Bareword found where operator expected at extract-daxim.pl line 14, near "s/[.]html$/.txt/r" ` is it the tilde ~? – alvas Dec 20 '11 at 05:59
  • i don't understand this part of the code ` . file($file)->basename =~ s/[.]html$/.txt/r;` the file and basename has no `my` or `$`, are they from the Path::Class and File::Glob ?? – alvas Dec 20 '11 at 06:01
  • 2er0, [the `r` modifier is available only in Perl 5.13.2 or better](http://p3rl.org/perl5132delta#Non-destructive-substitution). Either install an additional stable Perl 5.14.x ([perlbrew](http://perlbrew.pl) makes this easy), or use the revised version of the code above that works around the lack of `r`. — `file` is an exported function, `basename` is a method. Both are documented in [Path::Class::File](http://p3rl.org/Path::Class::File). Please read an intermediate level Perl teaching book to understand the underlying concepts. – daxim Dec 20 '11 at 10:44
3

Use find_by_tag_name to search for tag names, not find_by_attribute.

choroba
  • 231,213
  • 25
  • 204
  • 289
3

You want find_by_tag_name, not find_by_attribute:

my @all_p = $root->find_by_tag_name("p");

From the docs:

$h->find_by_tag_name('tag', ...)

In list context, returns a list of elements at or under $h that have any of the specified tag names. In scalar context, returns the first (in pre-order traversal of the tree) such element found, or undef if none.

Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
  • does it mean if the

    tag is embedded, i might need to rerun the loop again? e.g. `

    ...

    ...<\p>...<\p>`

    – alvas Dec 19 '11 at 13:13
  • @2er0 this method will return all `p` elements at once. You can use it on the resulting elements in turn to find nested `p`s. – Eugene Yarmash Dec 19 '11 at 13:29
1

You might want to take a look at Mojo::DOM which lets you use CSS selectors.

Alexander Hartmaier
  • 2,178
  • 12
  • 21