2

I am using WWW::Mechanize, HTML::TreeBuilder and HTML::Element in my perl-script to navigate through html-Documents.

I want to know how to search for an element, that contains a certain string as text.

Here is an example of an html-document:

<html>
  <body>
    <ul>
      <li>
       <div class="red">Apple</div>
       <div class="abc">figure = triangle</div>
      </li>
      <li>
       <div class="red">Banana</div>
       <div class="abc">figure = square</div>
      </li>
      <li>
       <div class="green">Lemon</div>
       <div class="abc">figure = circle</div>
      </li>
      <li>
       <div class="blue">Banana</div>
       <div class="abc">figure = line</div>
      </li>
    </ul>
  </body>
</html>

I want to extract the text square. To get it, I have to search for an element with this properties:

  • tag-name is "div"
  • class is "red"
  • content is text "Banana"

Then I need to get it's parent (a <li>-element), and from the parent the child who's text starts with figure =, but this, and the rest, is easy.

I tried it this way:

use strict;
use warnings;
use utf8;
use Encode;
use WWW::Mechanize;
use HTML::TreeBuilder;
use HTML::Element;

binmode STDOUT, ":utf8";

my $mech = WWW::Mechanize->new();

my $uri = 'http.....'; #URI of an existing html-document

$mech->get($uri);
if (($mech->success()) && ($mech->is_html())) {
    my $resp = $mech->response();
    my $cont = $resp->decoded_content;
    my $root = HTML::TreeBuilder->new_from_content($cont);

    #this works, but returns 2 elements:
    my @twoElements = $root->look_down('_tag' => 'div', 'class' => 'red');

    #this returns an empty list:
    my @empty = $root->look_down('_tag' => 'div', 'class' => 'red', '_content' => 'Banana');

    # do something with @twoElements or @empty   
}

What must I use instead the last command to get the wanted element?

I am not looking for a workaround (I've found one). What I want to have is a native function of WWW::Mechanize, HTML::Tree or any other cpan-modul.

Hubert Schölnast
  • 8,341
  • 9
  • 39
  • 76
  • Why do you have to find the red banana instead of just finding the square figure? – Len Jaffe Jun 08 '15 at 16:55
  • 1
    I am searching is 1000+ websites for data. They all have the same structure. What is "red Banana" in my simplified example-document is a fix text and a fix class that exists in all 1000+ documents. What varies (and what I am trying to extract) is what in my example is "square" and "circle"). You can think of "red + Banana" as a key and "square" as the value. – Hubert Schölnast Jun 08 '15 at 17:09
  • You can forget about `WWW::Mechanize` and write just `my $root = HTML::TreeBuilder->new_from_url($uri)` – Borodin Jun 08 '15 at 23:40

1 Answers1

0

here's psuedocode/unttested Perl:

  my @twoElements = $root->look_down('_tag' => 'div', 'class' => 'red');
  foreach my $e ( @twoElements ) {
     next unless $e->content_list->[0] eq 'Banana';
     my $e2 = $e->right;   # get the sibling - might need to try left() depending on ordering
     my ($shape) = $e2->content_list->[0] =~ /figure = (.+)/;

     # do something with shape...

  }

Not perfect, but it should get you started, and it's general enough to reuse easily. otherwise replace

    ($shape) = $e2->content_list->[0] =~ /figure = (.+)/;

with something like

$shape = 'square' if $e2->content_list->[0] =~ /square/;

This might be a little cleaner:

my @elements = $root->look_down('_tag' => 'div', 'class' => 'red' ); foreach my $e ( @elements ) { next unless $e->as_trimmed_text eq 'Banana'; my $e2 = $e->right; my ($shape) = $e2->as_trimmed_text =~ /figure = (.+)/;

     # do something with shape...
  }

WWW::Mechanize::TreeBuilder

Len Jaffe
  • 3,442
  • 1
  • 21
  • 28
  • This is very similar to the workaround that I use. But I expected a more native solution, i.e. a mechanize-command to search for text. – Hubert Schölnast Jun 08 '15 at 18:30
  • Explicitly: I'm not sure. I did not find an explicit documentation of a text-find-function. But "I didn't find it" and "It doesn't exist" are two different things, and this is the reason why I have to ask here. - Implicitly: Yes. WWW::Mechanize and the many HTML::* modules cover (almost?) everything that has to do with parsing html-documents. This is why I guess that there must be a text-search-function hidden in the jungle of documentations. To search for text in a html-document is a common task, so it is hard to believe that the developers didn't create a function that performs this task. – Hubert Schölnast Jun 08 '15 at 20:09
  • Mech is mostly concerned with navigation, so the bulk of the methods deal with links, forms, and form elements. You've already hit on TreeBuilder for parsing and querying the HTML, for when a simple $content =~ /something/ is not good enough. – Len Jaffe Jun 08 '15 at 22:15
  • > This is why I guess that there must be a text-search-function hidden in the jungle – Len Jaffe Jun 08 '15 at 22:22