4

Using HTML::TreeBuilder -- or Mojo::DOM -- I'd like to scrape the content but keep it in order, so that I can put the text values into an array (and then replace the text values with a variable for templating purposes)

But this in TreeBuilder

my $map_r = $tree->tagname_map();

my @contents = map { $_->content_list } $tree->find_by_tag_name(keys %$map_r);

foreach my $c (@contents) {
  say $c;
}

doesn't return the order -- of course hashes aren't ordered. So, how to visit the tree from root down and keep the sequence of values returned? Recursively walk the tree? Essentially, I'd like to use the method 'as_text' except for each element. (Followed this nice idea but I need it for all elements)

Community
  • 1
  • 1
sqldoug
  • 429
  • 1
  • 3
  • 10
  • Whats input to page/source file to HTML::TreeBuilder? – sandeep Sep 03 '15 at 10:44
  • Not a URL, just local html files (on disk) – sqldoug Sep 03 '15 at 19:30
  • how about `my @content = content_list()` instead of tagname_map(), since you are using tagname_map(), but without any parameters. That should be ordered. Seems weird, since you are using it in the next line, but if you want only certain tags, you should've posted that. – bytepusher Sep 04 '15 at 08:57
  • Thanks; I'll give that a try. I want all tags, since I don't know which of them has text. – sqldoug Sep 04 '15 at 20:15
  • That just loses me, trying to dereference everything. For example: `

    some text now bold extra text

    ` Should be "some text", "now bold", "extra text" (a quoted array of which is not the problem, that I can handle), rather than "some text", "extra text", "now bold" which Mojo::DOM does with `for my $x ( $dom->parse($html)->find('*')->each ) { my $text = $x->text; chomp $text; push @text, $text; }`
    – sqldoug Sep 09 '15 at 18:05

1 Answers1

0

This is better (using Mojo::DOM):

$dom->parse($html)->find('*')->each(
    sub {
        my $text = shift->text;
        $text =~ s/\s+/ /gi;
        push @text, $text;
    }
  );

However, any further comments are welcome.

sqldoug
  • 429
  • 1
  • 3
  • 10