3

Trying to make a Perl script to open an HTML file and extract anything contained within <span class="postertrip"> tags.

Sample HTML:

<table>
   <tbody>
      <tr>
         <td class="doubledash">&gt;&gt;</td>
         <td class="reply" id="reply2">
            <a name="2"></a> <label><input type="checkbox" name="delete" value="1199313466,2" /> <span class="replytitle"></span> <span class="commentpostername"><a href="test">Test1</a></span><span class="postertrip"><a href="test">!AAAAAAAA</a></span>  08/01/03(Thu)02:06</label> <span class="reflink"> <a href="test">No.2</a> </span>&nbsp;  <br /> <span class="filesize">File: <a target="_blank" href="test">1199326003295.jpg</a> -(<em>65843 B, 288x412</em>)</span> <span class="thumbnailmsg">Thumbnail displayed, click image for full size.</span><br />  <a target="_blank" test"> <img src="test" width="139" height="200" alt="65843" class="thumb" /></a>    
            <blockquote>
               <p>Test message 1</p>
            </blockquote>
         </td>
      </tr>
   </tbody>
</table>
<table>
   <tbody>
      <tr>
         <td class="doubledash">&gt;&gt;</td>
         <td class="reply" id="reply5">
            <a name="5"></a> <label><input type="checkbox" name="delete" value="1199313466,5" /> <span class="replytitle"></span>  <span class="commentpostername">Test2</span><span class="postertrip">!BBBBBBBB</span> 08/01/03(Thu)16:12</label> <span class="reflink"> <a href="test">No.5</a> </span>&nbsp;  
            <blockquote>
               <p>Test message 2</p>
            </blockquote>
         </td>
      </tr>
   </tbody>
</table>
<table>
   <tbody>
      <tr>
         <td class="doubledash">&gt;&gt;</td>
         <td class="reply" id="reply7">
            <a name="7"></a> <label><input type="checkbox" name="delete" value="1199161229,7" /> <span class="replytitle"></span>  <span class="commentpostername">Test3</span><span class="postertrip">!CCCCCCCC.</span> 08/01/01(Tue)17:53</label> <span class="reflink"> <a href="test">No.7</a> </span>&nbsp;  
            <blockquote>
               <p>Test message 3</p>
            </blockquote>
         </td>
      </tr>
   </tbody>
</table>

Desired output:

!AAAAAAAA
!BBBBBBBB
!CCCCCCCC

Current script:

#!/usr/bin/env perl

use warnings;
use strict;
use 5.010;

use HTML::TreeBuilder;


open(my $html, "<", "temp.html")
        or die "Can't open";


my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);


foreach my $e ($tree->look_down('class', 'reply')) {
    my $e = $tree->look_down('class', 'postertrip');
    say $e->as_text;
}

Bad output of script:

!AAAAAAAA
!AAAAAAAA
!AAAAAAAA
Özgür Can Karagöz
  • 1,039
  • 1
  • 13
  • 32
Displayname71
  • 142
  • 2
  • 9

2 Answers2

5

in your foreach-loop you have to look down from the element you found. So the correct code is:

foreach my $parent ($tree->look_down('class', 'reply')) {
    my $e = $parent->look_down('class', 'postertrip');
    say $e->as_text;
}
Georg Mavridis
  • 2,312
  • 1
  • 15
  • 23
5

I've never liked HTML::TreeBuilder. It's a bit of a complicated mess, and it hasn't been updated in three years. Using CSS selectors with Mojo::DOM is pretty easy though. Its find does all that work that the various look_downs do:

use v5.10;
use Mojo::DOM;

my $html = do { local $/; <DATA> };

my @values = Mojo::DOM->new( $html )
    ->find( 'td.reply span.postertrip' )
    ->map( 'all_text' )
    ->each;

say join "\n", @values;

Note that in your HTML::TreeBuilder code, you don't have the logic to select the tags you care about. You can do it but you need extra work. The CSS selectors take care of that for you.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • Thank you, I was able to get it working using the other answer but if I need to do anything more like this I will look into Mojo::DOM – Displayname71 Jun 07 '20 at 14:38