Use HTML::TreeBuilder in Perl to extract all instances of a specific span class

Question

Trying to make a Perl script to open an HTML file and extract anything contained within <span class="postertrip"> tags.

Sample HTML:

<table>
   <tbody>
      <tr>
         <td class="doubledash">&gt;&gt;</td>
         <td class="reply" id="reply2">
            <a name="2"></a> <label><input type="checkbox" name="delete" value="1199313466,2" /> <span class="replytitle"></span> <span class="commentpostername"><a href="test">Test1</a></span><span class="postertrip"><a href="test">!AAAAAAAA</a></span>  08/01/03(Thu)02:06</label> <span class="reflink"> <a href="test">No.2</a> </span>&nbsp;  <br /> <span class="filesize">File: <a target="_blank" href="test">1199326003295.jpg</a> -(<em>65843 B, 288x412</em>)</span> <span class="thumbnailmsg">Thumbnail displayed, click image for full size.</span><br />  <a target="_blank" test"> <img src="test" width="139" height="200" alt="65843" class="thumb" /></a>    
            <blockquote>
               <p>Test message 1</p>
            </blockquote>
         </td>
      </tr>
   </tbody>
</table>
<table>
   <tbody>
      <tr>
         <td class="doubledash">&gt;&gt;</td>
         <td class="reply" id="reply5">
            <a name="5"></a> <label><input type="checkbox" name="delete" value="1199313466,5" /> <span class="replytitle"></span>  <span class="commentpostername">Test2</span><span class="postertrip">!BBBBBBBB</span> 08/01/03(Thu)16:12</label> <span class="reflink"> <a href="test">No.5</a> </span>&nbsp;  
            <blockquote>
               <p>Test message 2</p>
            </blockquote>
         </td>
      </tr>
   </tbody>
</table>
<table>
   <tbody>
      <tr>
         <td class="doubledash">&gt;&gt;</td>
         <td class="reply" id="reply7">
            <a name="7"></a> <label><input type="checkbox" name="delete" value="1199161229,7" /> <span class="replytitle"></span>  <span class="commentpostername">Test3</span><span class="postertrip">!CCCCCCCC.</span> 08/01/01(Tue)17:53</label> <span class="reflink"> <a href="test">No.7</a> </span>&nbsp;  
            <blockquote>
               <p>Test message 3</p>
            </blockquote>
         </td>
      </tr>
   </tbody>
</table>

Desired output:

!AAAAAAAA
!BBBBBBBB
!CCCCCCCC

Current script:

#!/usr/bin/env perl

use warnings;
use strict;
use 5.010;

use HTML::TreeBuilder;


open(my $html, "<", "temp.html")
        or die "Can't open";


my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);


foreach my $e ($tree->look_down('class', 'reply')) {
    my $e = $tree->look_down('class', 'postertrip');
    say $e->as_text;
}

Bad output of script:

!AAAAAAAA
!AAAAAAAA
!AAAAAAAA

score 5 · Accepted Answer · answered Jun 07 '20 at 08:09

5

in your foreach-loop you have to look down from the element you found. So the correct code is:

foreach my $parent ($tree->look_down('class', 'reply')) {
    my $e = $parent->look_down('class', 'postertrip');
    say $e->as_text;
}

answered Jun 07 '20 at 08:09

Georg Mavridis

2,312
1
15
23

Along with that fix, either give a filename to `parse_file` or the HTML string to `parse`. – brian d foy Jun 07 '20 at 08:14
passing the file-handle to parse_file also works (at least in my test-installation) – Georg Mavridis Jun 07 '20 at 08:49

score 5 · Answer 2 · answered Jun 07 '20 at 08:10

I've never liked HTML::TreeBuilder. It's a bit of a complicated mess, and it hasn't been updated in three years. Using CSS selectors with Mojo::DOM is pretty easy though. Its find does all that work that the various look_downs do:

use v5.10;
use Mojo::DOM;

my $html = do { local $/; <DATA> };

my @values = Mojo::DOM->new( $html )
    ->find( 'td.reply span.postertrip' )
    ->map( 'all_text' )
    ->each;

say join "\n", @values;

Note that in your HTML::TreeBuilder code, you don't have the logic to select the tags you care about. You can do it but you need extra work. The CSS selectors take care of that for you.

Thank you, I was able to get it working using the other answer but if I need to do anything more like this I will look into Mojo::DOM — Displayname71, Jun 07 '20 at 14:38

Use HTML::TreeBuilder in Perl to extract all instances of a specific span class

2 Answers2