3

I am trying to understand some advanced (for me) perl syntax for html parsing using DOM following this tutorial:

say "div days:";
say $_->text for $dom->find('div.days')->each;

say "\nspan hours:";
say $_->text for $dom->find('span.hours')->each;

What does this syntax mean? What kind of loop is this? Classic for construction looks like this: for(i=0;i<10;i++){ code } not: {code} for (some_condition)

Also what does "each" keyword mean in this context? Does it have something common with the each Perl builtin function or it is specific to Mojo::DOM? I think if each is under Mojo::Dom it should be mentioned on the Mojo::DOM homepage. But I did not found any mention of each under the methods section of their site, so it must be a builtin function of Perl. But, this builtin each function has a completely different syntax -- how is this possible?

Another example from tutorial page

say "Open Times:";
say for $dom->find('div.openTime')
            ->map(sub{$_->children->each})
            ->map(sub{$_->text})
            ->each;

Same issue as above for map and sub methods.

  • Can those pieces of "Perlish" code rewritten in a more "C style" manner so I can understand it?
  • Most importantly: How to list all methods their parameters and return values contained in Mojo::DOM? It must be done somehow, because I read that even for Perl there are IDEs with intellisense (autocompletion) so this IDE must know the methods return value types etc.
nneonneo
  • 171,345
  • 36
  • 312
  • 383
Wakan Tanka
  • 7,542
  • 16
  • 69
  • 122
  • I answer in a full answer below, but let me stress that the reason you are not finding all of the method names is that the "missing" ones are actually methods on [Mojo::Collection](http://p3rl.org/Mojo::Collection) which is a container object for holding more than one dom object. Again see below. – Joel Berger Oct 10 '12 at 21:48
  • If any of the answers below was helpful to you, please take the time to accept it. You can do so by clicking the check mark on the left. See [faq#howtoask] if you need help. – simbabque Feb 06 '13 at 09:07

3 Answers3

9
say "Open Times:";
say for $dom->find('div.openTime')
            ->map(sub{$_->children->each})
            ->map(sub{$_->text})
            ->each;

All those keywords (find, map, each) are actually not keywords but methods from Mojo::DOM. You can recognize them by the -> operator.

In this case, several methods have been chained together. That is only possible if each of them returns its object (in this example it's $dom) again. This kind of chaining is often used in JavaScript, especially with modern frameworks like jQuery. It makes the code easy to read and saves operations.

Basically, you apply several transactions in a chain.

  1. find all elements 'div.openTime'
  2. map ( do stuff with each of ) them with a given sub (this is an actual Perl sub):
    1. get all children of the current element as a collection
    2. and list each of them (as in, return an array)
  3. map them with a given sub:
    1. extract text content from the element
  4. and list each of them

All this is wrapped in a postfix foreach (as said by @Quentin). say is a feature you can load with use features qw(say). It combines print and a newline-character.

Maybe now it's clearer what is happening here:

my $collection1 = $dom->find('div.oopenTime');

my $collection2 = $collection1->map(
  sub {
    my $collection = $_->children;
    return $collection->each;
  }
);

my collection3 = $collection2->map(
  sub {
    return $_->text;
  }
);

foreach my $text ($collection3) {
  say $text;
}

IDEs that provide autocompletion will usually scan the code in question to know the methods an object has. Take a look at How do I list available methods on a given object or package in Perl? or read the code of the module. Even better: read the documentation.

Community
  • 1
  • 1
simbabque
  • 53,749
  • 8
  • 73
  • 136
5

What does this syntax mean, what is going on here ?

It is a postfix for loop.

for (@foo) {
    say $_
}

can be written as

say $_ for @foo;

Also what does "each" keyword mean in this context

It is a method on the object. It returns a list of things in the Mojo::Collection.

Community
  • 1
  • 1
Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
3

It seems that the other answers have explained what I wrote in my tutorial post. That said I wanted to add that I have come to grips with another useful method in Mojo::DOM (actually in the Mojo::Collection class) called pluck. This method reduces the visual complexity of

->map(sub{$_->text})

to

->pluck('text')

Further I have noticed that at least a few of my each calls were extraneous and that a Mojo::Collection used in a list context will "Do What I Mean" and each automagically. Edit: I checked this and in fact when used as a string the elements are joined with a newline. As this isn't exactly what I want, I have returned my each calls.

All that said here is how I might write that same tutorial script now:

#!/usr/bin/env perl

use strict;
use warnings;

use 5.10.0;
use Mojo::DOM;

my $dom = Mojo::DOM->new(<<'HTML');
<div class="box notranslate" id="venueHours">
<h5 class="translate">Hours</h5>
<div class="status closed">Currently closed</div>
<div class="hours">
  <div class="timespan">
    <div class="openTime">
      <div class="days">Mon,Tue,Wed,Thu,Sat</div>
      <span class="hours"> 10:00 AM–6:00 PM</span>
    </div>
  </div>
  <div class="timespan">
    <div class="openTime">
      <div class="days">Fri</div>
      <span class="hours"> 10:00 AM–9:00 PM</span></div>
    </div>
    <div class="timespan">
      <div class="openTime">
        <div class="days">Sun</div>
        <span class="hours"> 10:00 AM–5:00 PM</span>
      </div>
    </div>
  </div>
</div>
HTML

say "div days:";
say for $dom->find('div.days')->pluck('text')->each;

say "\nspan hours:";
say for $dom->find('span.hours')->pluck('text')->each;

say "\nOpen Times:";
say for $dom->find('div.openTime')
            ->map(sub{$_->children->each})
            ->pluck('text')
            ->each;

Note that I don't use ->pluck('children') because the children method returns a Mojo::Collection object, meaning that the return from pluck would be a collection of collections. In order to flatten the structure I need to call each on the result of the children call and thus I cannot remove that particular ->map call.

However, now I wonder if I couldn't avoid this hassle all together? Mojo::DOM has excellent support for CSS3 selectors (w3schools reference), and one thing I might try would be not to select the parent (div.openTime) directly but select its children in the selector.

say "\nOpen Times:";
say for $dom->find('div.openTime > *')->pluck('text')->each;

So there is a good lesson here: allowing the selector to give you as nearly the collection that you want saves you having to transform it later.


To answer your final questions:

To translate this

say for $dom->find('div.openTime')
            ->map(sub{$_->children->each})
            ->map(sub{$_->text})
            ->each;

to more C-esque Perl (though I wont take it to the for(i=0;i<10;i++){ ... } extreme) it might look something like

my @open_times = $dom->find('div.openTime')->each;

my @all_children;
foreach my $elem ( @open_times ) {
  my @children = $elem->children->each;
  push @all_children, @children;
}

my @texts;
foreach my $child ( @all_children ) {
  push @texts, $child->text;
}

foreach my $text ( @texts ) {
  print $text . "\n";
}

I'm sure you can see why I prefer the Mojo (object-chaining) way.

As to your second question: Mojolicious has great (if sometimes oververbose) documentation. Start here to learn about the whole system. Specifically reading about Mojo::DOM and Mojo::Collection should be enough to handle DOM parsing. I think part of your problem is that you didn't notice the interdependency of the DOM and Collection objects and so you mistakenly assumed that all the method calls were on DOM objects. When you read carefully you will see that some of the DOM methods (those that return might more that one result) return Collection objects, and find is one such method.

Joel Berger
  • 20,180
  • 5
  • 49
  • 104