Relative begginer with Perl, with my first question here, trying the following:
I am trying to retrieve certain information from a large online dataset (Eur-Lex), where each HTML document is well-formed HTML, with constant elements. Each HTML file is identified by its Celex number, which is supplied as the argument to the script (see my Perl code below). The HTML data looks like this (showing only the part I'm interested in):
<!--
<blahblah>
< lots of stuff here, before the interesting part>
-->
<div id="PPClass_Contents" class="panel-collapse collapse in" role="tabpanel"
aria-labelledby="PP_Class">
<div class="panel-body">
<dl class="NMetadata">
<dt xmlns="http://www.w3.org/1999/xhtml">EUROVOC descriptor: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=341&lang=en">
<span lang="en">descriptor_1</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=5158&lang=en">
<span lang="en">descriptor_2</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=7983&lang=en">
<span lang="en">descriptor_3</span>
</a>
</li>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&DC_CODED=933&lang=en">
<span lang="en">descriptor_4</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Subject matter: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>
<a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CT_CODED=BUDG&lang=en">
<span lang="en">Subject_1</span>
</a>
</li>
</ul>
</dd>
<dt xmlns="http://www.w3.org/1999/xhtml">Directory code: </dt>
<dd xmlns="http://www.w3.org/1999/xhtml">
<ul>
<li>01.60.20.00 <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_1_CODED=01&lang=en">
<span lang="en">Designation_level_1</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_2_CODED=0160&lang=en">
<span lang="en">Designation_level_2</span>
</a> / <a href="./../../../search.html?type=advanced&DTS_DOM=ALL&DTS_SUBDOM=ALL_ALL&SUBDOM_INIT=ALL_ALL&CC_3_CODED=016020&lang=en">
<span lang="en">Designation_level_3</span>
</a>
</li>
</ul>
</dd>
</dl>
</div>
</div>
</div>
<!--
<still more stuff here>
-->
I am interested in the info contained in "PPClass_Contents" div id, which consists of 3 elements:
- EUROVOC descriptor: - Subject matter: - Directory code:
Based on the above HTML, I would like to get the children of those 3 main elements, using Perl and Mojo, getting the result similar to this (single line text file, 3 groups separated by tabs, multiple child elements within a grup are separated by pipe characters, something like this:
CELEX_No "TAB" descriptor_1|descriptor_2|descriptor_3|descriptor_4|..|descriptor_n "TAB" Subject_1|..|Subject_n "TAB" Designation_level_1|Designation_level_2|Designation_level_3|..|Designation_level_n
"descriptors", "Subjects" and "Designation_levels" elements (children of those 3 main groups) can be from 1 to "n", the number is not fixed, and is not known in advance.
I have the following code, which does print out the plain text of the interesting part, but I need to address the individual elements and print them out in a new file as described above:
#!/usr/bin/perl # returns "Classification" descriptors for given CELEX and Language use strict; use warnings; use Mojo::UserAgent; if ($#ARGV ne "1") { print "Wrong number of arguments!\n"; print "Syntax: clookup.pl Lang_ID celex_No.\n"; exit -1; } my $lang = $ARGV[0]; my $celex = $ARGV[1]; my $lclang = lc $lang; # fetch the eurlex page my $ua = Mojo::UserAgent->new; my $dom = $ua->get("https://eur-lex.europa.eu/legal-content/$lang/ALL/?uri=CELEX:$celex")->res->dom; ################ let's extract interesting parts: my $text = $dom->at('#PPClass_Contents')->all_text; print "$text\n";
EDIT (added): You can try my Perl script using two arguments:
lang_code ("DE","EN","IT", etc.)
Celex number (e.g.: E2014C0303, 52015BP2212, 52015BP0930(48), 52015BP0930(36), 52015BP0930(41), E2014C0302, E2014C0301, E2014C0271, E2014C0134).
For example (if you name my script "clookup.pl"):
$ perl clookup.pl EN E2014C0303
So, how can I address individual elements (of unknown number) as described above, using Mojo::DOM?
Or, is there something simpler or faster (using Perl)?