2

I have the following XML and I'm hoping to obtain the child element from the same parent if a regex matches another child element. The problem is, the XML has duplicate naming tags all over the place, so it's hard to simply do Movie->Year because there are many movie elements.

e.g.

Data:

<movie>
    <title>Titanic</title>
    <year>1997</year>
    <genre>Drama</genre>
</movie>
<movie>
    <title>Moneyball</title>
    <year>2011</year>
    <genre>Sport/Drama</genre>
</movie>
<movie>
    <title>Fight Club</title>
    <year>1999</year>
    <genre>Drama/Action</genre>
</movie>

Perl

 my $simple = XML::Simple->new( );
 my $tree = $simple->XMLin($_);
 my $movie = $tree->{movie}{title};

if($movie =~ /Titanic/)
{
    # $movie -> year ???
    # desired output = 1997
}

What is the easiest way to do this with XML::Simple ?

  • 1
    Please _not not not_ with `XML::Simple`. While that module certainly had its place it's been outdated for a long time, and its own author has ["strongly discouraged"](https://metacpan.org/pod/XML::Simple#STATUS-OF-THIS-MODULE) its use for years, and has written a [tutorial](https://grantm.github.io/perl-libxml-by-example/) for another. Go for either [XML::LibXML](https://metacpan.org/pod/XML::LibXML) or [XML::Twig](https://metacpan.org/pod/XML::Twig) – zdim Mar 21 '19 at 23:57

4 Answers4

4

There is no easy way with XML::Simple because it's the hardest XML parser to use. It's own documentation warns against using it. ("The use of this module in new code is strongly discouraged.")


What you have there isn't valid XML, so we first have to make it valid XML

use XML::LibXML qw( );

my $parser = XML::LibXML->new();
my $doc = $parser->parse_string("<movies>$not_quite_xml</movies>");

my ($movie_node) = $doc->findnodes('/movies/movie[title/text()="Titanic"]')
   or die("Titanic not found\n");

my $year = $movie_node->findvalue('year/text()');
...
ikegami
  • 367,544
  • 15
  • 269
  • 518
3

I hope that it has been conveyed that XML::Simple should not be used, being superseded by far better modules a long time ago and "strongly discouraged" against by its own author, also years ago.

This example shows a way to use the parent's node in order to query siblings, as specifically asked for. (I pad your sample with the <document> root node so to have a well-formed XML.) The answer by ikegami shows how you can more directly do what you seem to need.

If you have a reason to scan through the <title> nodes (perhaps looking for a variety of titles), then their sibling <year> nodes can be found by

use strict;
use warnings;
use feature 'say';    

use XML::LibXML;    

my $file = shift || die "Usage: $0 filename\n";

my $doc = XML::LibXML->load_xml(location => $file, no_blanks => 1); 

my $xpath = '/document/movie/title';

foreach my $node ($doc->findnodes($xpath)) {
    if ($node->to_literal =~ /(Titanic)/) {
        say "Title: $1";
        foreach my $yr ($node->parentNode->findnodes('./year')) {
            say "\tyear: ", $yr->to_literal;
        }   
    }   
}

If there is always a single <year> node under one <movie> node then this can be simplified by the shortcut findvalue, replacing the loop over $node->parentNode->findnodes, for

foreach my $node ($doc->findnodes($xpath)) {
    if ($node->to_literal =~ /(Titanic)/) {
        say "Title: $1";
        say "\tyear: ", $node->parentNode->findvalue('./year');
    }   
}

Here we get the text directly and so there is no need for ->to_literal either.

There are many more methods in XML::LibXML::Node, the base class for nodes used to derive other particular classes. One of interest here may be nextSibling, as a way to peruse other information about the title within one <movie>.

Note that this complete and feature-full library provides many more tools for working with XML. For one, adding details to your source file, like attributes, would allow use of the library's other strengths.

The documentation is spread over a number of pages. See this post for a summary of links to relevant docs. There is also a tutorial for XML::LibXML, by the author of XML::Simple.

zdim
  • 64,580
  • 5
  • 52
  • 81
2

Yet another way to do it, with Mojo::DOM this time. There's nothing to recommend this over other solutions (besides the XML::Simple one).

This adds a root element then uses a CSS selector to grab the titles:

use utf8;
use strict;
use warnings;

my $xml = <<'HERE';
<movies>
<movie>
    <title>Titanic</title>
    <year>1997</year>
    <genre>Drama</genre>
</movie>
<movie>
    <title>Moneyball</title>
    <year>2011</year>
    <genre>Sport/Drama</genre>
</movie>
<movie>
    <title>Fight Club</title>
    <year>1999</year>
    <genre>Drama/Action</genre>
</movie>
</movies>
HERE

use Mojo::DOM;

my @movies = Mojo::DOM
    ->new( $xml )
    ->find( 'movies title' )
    ->map( 'text' )
    ->each;

say join "\n", @movies;
brian d foy
  • 129,424
  • 31
  • 207
  • 592
1

You can also call a command line tool like xmlstarlet from Perl to quickly extract just the information you need.

For instance, if your fragment of an XML document was stored at /tmp/foo.xml, then the following shell script will convert it into a tabular form which is easier to process in Perl by reading a line at a time.

{ echo '<movies>' ; cat /tmp/foo.xml ; echo '</movies>'; } \
    | xmlstarlet sel -T -t -m '//movie' -v "concat(title, '|', year)" -n

prints

Titanic|1997
Moneyball|2011
Fight Club|1999

This particular way of converting the xml document to a more convenient form is not robust against newlines or |s in movie titles and requires an external tool, but it is easy.

Greg Nisbet
  • 6,710
  • 3
  • 25
  • 65