1

I realize that there are many similar questions, but I am still unable to find the specific answer that I am looking for.

I am using Perl with the XML::LibXML library to read information from an XML file. The XML file has many nodes and many child nodes (and child child nodes, etc). I am trying to pull the information out of the XML file 'per node' but am really getting into the weeds trying to figure out how to do that.

Here is just an example of what I am trying to do:

#!/usr/bin/perl -w

use XML::LibXML

open ($xml_fh, "<test.xml");
my $dom = XML::LibXML->load_xml(IO => $xml_fh);;
close($xml_fh);

foreach $chapter ($dom->findnodes('/file/chapter')) {
        my $chapterNumber = $chapter->findvalue('@number');
        print "Chapter #$chapterNumber\n";

         #I tried $dom->findnodes('/file/chapter/section') <-- spelling out the xPath with same results..
        foreach $section ($dom->findnodes('//section')) {
                my $sectionNumber = $section->findvalue('@number');
                print " Section #$sectionNumber\n";

                foreach $subsection ($dom->findnodes('//subsection')) {
                        my $subsectionNumber = $subsection->findvalue('@number');
                        print "  SubSection $subsectionNumber\n";
                }
        }
}

This specific XML file is set up like this:

<file>
 <chapter number="1">
  <section number="abc123">
   There is some data here I'd like to get to
   <subsection number="abc123.(s)(4)">
    Some additional data here
    <subsection number="deeperSubSec">
     There might even be deeper subsections
     </subsection>
   </subsection>
  </section>
 </chapter>
 <chapter number="208">
  <section number="dgfj23">
   There is some data here I'd like to get to also
   <subsection number="dgfj23.(s)(4)">
    Some additional data here also
    <subsection number="deeperSubSec44">
     There might even be deeper subsections also
     </subsection>
   </subsection>
  </section>
 </chapter>
<chapter number="998">
  <section number="xxxid">
   There is even more data here I'd like to get to also
   <subsection number="xxxid.(s)(4)">
    Some additional data also here too
    <subsection number="deeperSubSec999">
     There might even be deeper subsections also again
     </subsection>
   </subsection>
  </section>
 </chapter>
</file>

Unfortunately, what I wind up with is just a list of repeating data. I am sure that this is because of my nested for loops, but I really an not grasping the fundamental understanding on how to operate on this data type. Hopefully someone has some resources or insight they could provide.

Here is my current output:

Chapter #1
 Section #abc123
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
 Section #dgfj23
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
 Section #xxxid
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
Chapter #208
 Section #abc123
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
 Section #dgfj23
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
 Section #xxxid
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
Chapter #998
 Section #abc123
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
 Section #dgfj23
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999
 Section #xxxid
  SubSection abc123.(s)(4)
  SubSection deeperSubSec
  SubSection dgfj23.(s)(4)
  SubSection deeperSubSec44
  SubSection xxxid.(s)(4)
  SubSection deeperSubSec999

so for each chapter, I am reading ALL sections, then I am reading ALL subsections, etc. Over and over again..

What I want to do is read, for each chapter, the associated sections, then for each of those sections, the associated subsections and any applicable sub-subsections therein..

like this:

Chapter #1
  Section #abc123
    Subsection #abc123.(s)(4
      Sub-Subsection #deeperSubSec
Chapter #208
   Section #dgfj23
    Subsection #dgfj23.(s)(4)
     Sub-Subsection #deeperSubSec44

etc...

Additionally, eventually, after I figure out how the basic operation works, I'll need to get access to the data contained within each chapter, section, subsection, etc. But I think I need to walk before I run, so I'll go with trying to get the simple value of the attributes first..

Thank you for your help.

JerseyDevel
  • 1,334
  • 1
  • 15
  • 34

1 Answers1

3

So I think I figured it out. I was operating on the $dom object the entire time which contains the entire XML tree. I believe what I needed to do was operate on the piece of the tree that I am looking at, like this:

#!/usr/bin/perl -w

use XML::LibXML

open ($xml_fh, "<test.xml");
my $dom = XML::LibXML->load_xml(IO => $xml_fh);;
close($xml_fh);


for $chapter ($dom->findnodes('/file/chapter')) {
        print "Chapter #" . $chapter->findvalue('@number') ."\n";
        foreach $section ($chapter->findnodes('section')) {
                print " Section #" .$section->findvalue('@number') . "\n";
                foreach $subsection ($section->findnodes('subsection')) {
                        print "  Subsection #" . $subsection->findvalue('@number') . "\n";
                }
        }
}

which results in output more like I was hoping for:

Chapter #1
 Section #abc123
  Subsection #abc123.(s)(4)
Chapter #208
 Section #dgfj23
  Subsection #dgfj23.(s)(4)
Chapter #998
 Section #xxxid
  Subsection #xxxid.(s)(4)

Here is a little bit of a neater example which helps illustrate that I am now addressing the specific part of the tree obtained from the previous loop that I am currently inside:

#!/usr/bin/perl -w

use XML::LibXML

open ($xml_fh, "<test.xml");
my $dom = XML::LibXML->load_xml(IO => $xml_fh);;
close($xml_fh);


my @chapters = $dom->findnodes('/file/chapter');

for $chapter (@chapters) {
        my $chapterNo = $chapter->findvalue('@number');
        print "Chpater #$chapterNo\n";

        @sections = $chapter->findnodes('section');
        for $section (@sections) {
                my $sectionNo = $section->findvalue('@number');
                print " Section #$sectionNo\n";

                @subsections = $section->findnodes('subsection');
                for $subsection (@subsections) {
                        my $subsectionNo = $subsection->findvalue('@number');
                        print "  Subsection #$subsectionNo\n";
                }
        }
}
JerseyDevel
  • 1,334
  • 1
  • 15
  • 34
  • 1
    You could also use the syntax `$chapter->findnodes('.//section')` as shown in [this](https://stackoverflow.com/a/11955619/2173773) answer. – Håkon Hægland Dec 29 '21 at 19:22
  • 1
    @Håkon Hægland `.//section` is short for `descendant:section`. It makes no sense to search all descendants here. `./section` is short for `child:section`. This is the tool that makes the most sense here. And since the `./` can be omitted, the answer is using the optimal solution by using just `section`. – ikegami Dec 29 '21 at 19:44
  • 1
    The problem isn't that "$dom contains the entire XML tree". All nodes do. The problem was twofold: 1) Searching from the wrong node because you invoked `findnodes` on the wrong node, and 2) Searching from the wrong node because you were providing absolute paths. Both of these needed to be fixed, and both of these were fixed. – ikegami Dec 29 '21 at 20:00
  • 1
    Tip: Using `->findvalue('@x')` instead of `->getAttribute('x')` is wasteful. There's no reason to parse and execute an XPath just to get an attribute. – ikegami Dec 29 '21 at 20:00
  • @ikegami, thanks for the getAttribute tip. If I want to get the data contained within the XML tag, what is the command for that? something like $this->getValue('myTag') does not work.. I cant find much info in the source that I am able to easily decipher.. Actually, I guess find value works there.. – JerseyDevel Dec 29 '21 at 20:29
  • 1
    `->textContent()` – ikegami Dec 29 '21 at 20:32
  • @ikegami - as a followup, I have posted another question. Hoping you can clarify: https://stackoverflow.com/questions/70525891/perl-xmllibxml-get-data-outside-of-a-tag – JerseyDevel Dec 29 '21 at 23:04