Targeting individual elements in HTML using Perl and Mojo::DOM in well-formated HTML

Question

Relative begginer with Perl, with my first question here, trying the following:

I am trying to retrieve certain information from a large online dataset (Eur-Lex), where each HTML document is well-formed HTML, with constant elements. Each HTML file is identified by its Celex number, which is supplied as the argument to the script (see my Perl code below). The HTML data looks like this (showing only the part I'm interested in):

<!-- 
 <blahblah>
< lots of stuff here, before the interesting part>
--> 

      <div id="PPClass_Contents" class="panel-collapse collapse in" role="tabpanel"
           aria-labelledby="PP_Class">
         <div class="panel-body">
            <dl class="NMetadata">
               <dt xmlns="http://www.w3.org/1999/xhtml">EUROVOC descriptor: </dt>
               <dd xmlns="http://www.w3.org/1999/xhtml">
                  <ul>
                     <li>
                        <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;DC_CODED=341&amp;lang=en">
                           <span lang="en">descriptor_1</span>
                        </a>
                     </li>
                     <li>
                        <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;DC_CODED=5158&amp;lang=en">
                           <span lang="en">descriptor_2</span>
                        </a>
                     </li>
                     <li>
                        <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;DC_CODED=7983&amp;lang=en">
                           <span lang="en">descriptor_3</span>
                        </a>
                     </li>
                     <li>
                        <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;DC_CODED=933&amp;lang=en">
                           <span lang="en">descriptor_4</span>
                        </a>
                     </li>
                  </ul>
               </dd>
               <dt xmlns="http://www.w3.org/1999/xhtml">Subject matter: </dt>
               <dd xmlns="http://www.w3.org/1999/xhtml">
                  <ul>
                     <li>
                        <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;CT_CODED=BUDG&amp;lang=en">
                           <span lang="en">Subject_1</span>
                        </a>
                     </li>
                  </ul>
               </dd>
               <dt xmlns="http://www.w3.org/1999/xhtml">Directory code: </dt>
               <dd xmlns="http://www.w3.org/1999/xhtml">
                  <ul>
                     <li>01.60.20.00 <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;CC_1_CODED=01&amp;lang=en">
                           <span lang="en">Designation_level_1</span>
                        </a> / <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;CC_2_CODED=0160&amp;lang=en">
                           <span lang="en">Designation_level_2</span>
                        </a> / <a href="./../../../search.html?type=advanced&amp;DTS_DOM=ALL&amp;DTS_SUBDOM=ALL_ALL&amp;SUBDOM_INIT=ALL_ALL&amp;CC_3_CODED=016020&amp;lang=en">
                           <span lang="en">Designation_level_3</span>
                        </a>
                     </li>
                  </ul>
               </dd>
            </dl>
         </div>
      </div>
   </div>

<!-- 
<still more stuff here>
-->

I am interested in the info contained in "PPClass_Contents" div id, which consists of 3 elements:


    - EUROVOC descriptor:
    - Subject matter:
    - Directory code:

Based on the above HTML, I would like to get the children of those 3 main elements, using Perl and Mojo, getting the result similar to this (single line text file, 3 groups separated by tabs, multiple child elements within a grup are separated by pipe characters, something like this:


    CELEX_No "TAB" descriptor_1|descriptor_2|descriptor_3|descriptor_4|..|descriptor_n "TAB" Subject_1|..|Subject_n "TAB" Designation_level_1|Designation_level_2|Designation_level_3|..|Designation_level_n

"descriptors", "Subjects" and "Designation_levels" elements (children of those 3 main groups) can be from 1 to "n", the number is not fixed, and is not known in advance.

I have the following code, which does print out the plain text of the interesting part, but I need to address the individual elements and print them out in a new file as described above:


    #!/usr/bin/perl
    # returns "Classification" descriptors for given CELEX and Language

    use strict;
    use warnings;

    use Mojo::UserAgent;

    if ($#ARGV ne "1") {
        print "Wrong number of arguments!\n";
        print "Syntax: clookup.pl Lang_ID celex_No.\n";
        exit -1;
    }

    my $lang = $ARGV[0];   
    my $celex = $ARGV[1];
    my $lclang = lc $lang;

    # fetch the eurlex page

    my $ua = Mojo::UserAgent->new;
    my $dom = $ua->get("https://eur-lex.europa.eu/legal-content/$lang/ALL/?uri=CELEX:$celex")->res->dom;


    ################ let's extract interesting parts:


    my $text = $dom->at('#PPClass_Contents')->all_text;
    print "$text\n";

EDIT (added): You can try my Perl script using two arguments:

lang_code ("DE","EN","IT", etc.)
Celex number (e.g.: E2014C0303, 52015BP2212, 52015BP0930(48), 52015BP0930(36), 52015BP0930(41), E2014C0302, E2014C0301, E2014C0271, E2014C0134).

For example (if you name my script "clookup.pl"): $ perl clookup.pl EN E2014C0303

So, how can I address individual elements (of unknown number) as described above, using Mojo::DOM?

Or, is there something simpler or faster (using Perl)?

score 1 · Accepted Answer · answered Jan 11 '19 at 17:06

1

You are on the right track. First, you need to understand the HTML inside your #PPClass_Contents. Each set of things is in a definition list. Since you only care about the definition texts, you can search directly for the <dd> elements.

$dom->at('#PPClass_Contents')->find('dd')

This will give you a Mojo::Collection, which you can iterate with ->each. We pass that an anonymous function, pretty much like a callback.

$dom->at('#PPClass_Contents')->find('dd')->each(sub {
    $_; # this is the current element
});

Each element will be passed to that sub, and can be referenced using the topic variable $_. There is an <ul> inside, and each <li> contains a <span> element with the text you want. So let's find those.

$_->find('span')

We can directly build the column in your output at this stage. Let's use the other form of ->each, which turns the Mojo::Collection returned from ->find into a normal Perl list. We can then use a regular map operation to grab each <span>'s text node and join that into a string.

 join '|', map { $_->text } $_->find('span')->each

To tie all that together, we declare an array outside this construct, and stick the $celex number in it as the first column.

my @columns = ($celex);
$dom->at('#PPClass_Contents')->find('dd')->each(sub {
    push @columns, join '|', map { $_->text } $_->find('span')->each;
});

Producing the final tab-separated output is now trivial.

print join "\t", @columns;

I've done this with EN as the language and the $celex number 32006L0121, which the search used in its example tooltip. The result is this:

32006L0121 marketing standard|chemical product|approximation of laws|dangerous substance|scientific report|packaging|European Chemicals Agency|labelling Internal market - Principles|Approximation of laws|Technical barriers|Environment|Consumer protection Industrial policy and internal market|Internal market: approximation of laws|Dangerous substances

answered Jan 11 '19 at 17:06

simbabque

53,749
8
73
136

Thanks! This produces exactly the results I need. It would have taken me a couple of days more to delve deeper into Mojo and css to get a kind of half-baked solution which would not be nearly as usable and elegant :) – Denis_HR Jan 11 '19 at 17:17
Here's another way you could write it, making full use of Mojo chaining: `$dom->at('#PPClass_Contents')->find('dd')->map(sub { $_->find('span')->map('text')->join('|') })->tap(sub { unshift @$_, $celex })->join("\t")->say;` Of course Mojo::DOM and Mojo::Collection are designed so you can use them as regular Perl data at whatever point makes sense. – Grinnz Jan 11 '19 at 18:36
@Grinnz I know this works, but I have to say it doesn't feel very readable to me. – simbabque Jan 11 '19 at 18:57
1

I wouldn't write it that way either, I was just showing the other extreme. I would likely use somewhere in between, with more line breaks. – Grinnz Jan 11 '19 at 18:58
@Grinnz, I tested your solution, and it also works. However, I agree with simbabque, it is not exactly "transparent" or obvious. :) BTW, how would your solution be redirected to a file? – Denis_HR Jan 11 '19 at 19:15
@Denis_HR if you remove the final `->say`, the return value is just a string. You can print that wherever you want, or `->say` also takes an optional filehandle to print to. (The `say` method is from Mojo::ByteStream FWIW) – Grinnz Jan 11 '19 at 19:16
@Grinnz, thanks for the help :) Actually, your solution produces even slightly better result, since it does not add an extraneous "TAB" character at the end of line :) So, the working script using your solution could be something like this: `my $ua = Mojo::UserAgent->new; my $dom = $ua->get("https://eur-lex.europa.eu/legal-content/$lang/ALL/?uri=CELEX:$celex")->res->dom; my $line = $dom->at('#PPClass_Contents')->find('dd')->map(sub { $_->find('span')->map('text')->join('|') })->tap(sub { unshift @$_, $celex })->join("\t"); print NEW "$line\n"; ` – Denis_HR Jan 11 '19 at 19:28
Sounds like there is an empty element in the data that gets processed with mine. – simbabque Jan 11 '19 at 19:46

Targeting individual elements in HTML using Perl and Mojo::DOM in well-formated HTML

1 Answers1