0

I want to parse an SVG file in Perl, but I see advice NOT to use certain libraries (XML::Simple, XML::XPath) for different reasons. The thread below suggests XML::LibXML::XPathContext;

Perl XML/SVG Parser unable to findnodes

Assuming I use XML::LibXML::XPathContext, I'm still not sure how to extract the nodes I'm interested in: 1) Those with an "id" that contains "Drawing...", their size (path fill... d=".. etc) and text ("tspan") 2) The "path" nodes (at the bottom of the SVG) which are NOT part of any "Drawing_" node and their location (d="...)

use XML::LibXML;
use XML::LibXML::XPathContext;

my $doc = XML::LibXML->load_xml( location => $file);
my $xpc = XML::LibXML::XPathContext->new( $doc);
$xpc->registerNs(x => 'http://www.w3.org/2000/svg');

foreach my $drawing ($xpc->findnodes( ??? ) {
    print "Found drawing\n";
}

foreach my $path ($xpc->findnodes( ??? ) {
    print "Found path\n";
}

My SVG:

<?xml version="1.0" encoding="UTF-8"?>
<svg version="1.2">
 <g visibility="visible" id="Master" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:space="preserve">
  <rect fill="none" stroke="none" x="0" y="0" width="86360" height="55880"/>
 </g>
 <g visibility="visible" id="Page1">
  <g id="Drawing_1">
   <path fill="rgb(255,211,32)" stroke="none" d="M 15350,3285 L 31988,3285 31988,4937 15350,4937 15350,3285 15350,3285 Z"/>
   <path fill="none" stroke="rgb(128,128,128)" stroke-width="102" stroke-linejoin="round" d="M 15350,3285 L 31988,3285 31988,4937 15350,4937 15350,3285 15350,3285 Z"/>
   <g fill="rgb(0,0,0)" stroke="none" font-family="Arial Narrow embedded" font-size="635" font-style="normal" font-weight="700">
    <text x="19327" y="3967">
     <tspan x="19327 19471 19788 19962">Info</tspan></text>
    <text fill="rgb(0,0,0)" stroke="none" x="17558" y="4699">
     <tspan x="17558">I</tspan></text>
   </g>
  </g>
  <g id="Drawing_2">
   <path fill="rgb(207,231,245)" stroke="none" d="M 8747,10525 L 4810,10525 4810,8239 12684,8239 12684,10525 8747,10525 Z"/>
   <path fill="none" stroke="rgb(128,128,128)" stroke-width="102" stroke-linejoin="round" d="M 8747,10525 L 4810,10525 4810,8239 12684,8239 12684,10525 8747,10525 Z"/>
   <g fill="rgb(0,0,0)" stroke="none" font-family="Arial Narrow embedded" font-size="635" font-style="normal" font-weight="700">
    <text x="5547" y="8872">
     <tspan x="5547 6030">OK</tspan></text>
    <text fill="rgb(0,0,0)" stroke="none" x="5215" y="9604">
     <tspan x="5215 5359 5676 5850">Info</tspan></text>
   </g>
  </g>
  ...
  <g>
   <path fill="none" stroke="rgb(51,153,255)" id="Drawing_78_0" stroke-width="102" stroke-linejoin="round" d="M 47291,16367 C 47291,17129 48093,16793 48482,17017"/>
   <path fill="rgb(51,153,255)" stroke="none" id="Drawing_78_1" d="M 48688,17383 L 48598,16917 48337,17064 48688,17383 Z"/>
  </g>
  <g>
   <path fill="none" stroke="rgb(51,153,255)" id="Drawing_79_0" stroke-width="102" stroke-linejoin="round" d="M 39417,4937 C 39417,14271 23887,8230 23425,16977"/>
   <path fill="rgb(51,153,255)" stroke="none" id="Drawing_79_1" d="M 23415,17383 L 23577,16937 23277,16929 23415,17383 Z"/>
  </g>
  ...
 </g>
</svg>
MrSparkly
  • 627
  • 1
  • 7
  • 17
  • the SVG shown here is a bit weird: the first element has a namespace, the others don't. The scope of a namespace declaration is the element and its descendants: https://www.w3.org/TR/xml-names/#scoping-defaulting – mirod Dec 19 '18 at 15:02

1 Answers1

2

First of all, you don't need to use XML::LibXML::XPathContext because your XML is not using namespaces.

However, you will have to loop through all the node attributes and check them. One way is to loop through the node attributes, and once you have found the nodes you want, do stuff with them (like extract attribute values, get child nodes, etc.) with the methods from XML::LibXML::Node

use v5.10;
use strict;
use warnings;

use XML::LibXML;

my $doc = XML::LibXML->load_xml( location => $ARGV[0] );

NODES: for my $node ($doc->findnodes('//g')) {
    for my $attr ($node->attributes) {
        if ($attr->nodeName eq 'id' && $attr->value =~ /^Drawing/) {
            # it's a drawing node
            # do stuff
            next NODES;
        }
    }
    # it's not a drawing node
    for my $pathnode ($node->findnodes('path')) {
        # do stuff
    }
}

You can also use pure XPath to find the nodes.

my @drawings = $doc->findnodes('//g[starts-with(@id,"Drawing")]');
my @paths = $doc->findnodes('//path[not(ancestor::g[starts-with(@id,"Drawing")])]');

Credit to these posts for XPath reference:

XPath Select Nodes where all parent nodes do not contain specific attribute and value
XPath: using regex in contains function

beasy
  • 1,227
  • 8
  • 16
  • Thanks, I got it to work using your code, though I did have to use XML::LibXML::XPathContext (as described in the linked article at the top of my post): i.e. NODES: for my $node ($xpc->findnodes('//x:g')) { ... – MrSparkly Dec 18 '18 at 23:58
  • @MrSparkly But your document doesn't mess with namespaces, like the linked question does (which is precisely the problem there). Did you try this as it stands, and/or am I missing something? – zdim Dec 19 '18 at 00:56
  • @zdim Yes, I tried it as it stands. Literally, switching between the original $doc->findnodes.. line and the $xpc->findnodes... line produces either nothing, or the results I want. – MrSparkly Dec 19 '18 at 01:03
  • @MrSparkly Hm, it works for me -- when I add prints (instead of the comment `# do stuff`) I get all those things you want. Even just printing `$attr` instead of the first `# do stuff` and `$pathnode` instead of the second shows everything. – zdim Dec 19 '18 at 06:36
  • @zdim I just tried it again: the original code doesn't work for me. I'm on Win, maybe it somehow makes a difference. I tried single slashes ('/g'), double slashes ('//g'): nothing. – MrSparkly Dec 19 '18 at 14:09
  • @MrSparkly OK. I just looked better, following a new comment, that your document _does_ have a namespace (on _one_ element!); perhaps that somehow makes it either work or not this way? At any rate, you did get it working so that's good :) – zdim Dec 19 '18 at 23:26