Extract data from an XML file with XML::LibXML

Question

I have an XML file like this containing thousands of entries

<mediawiki>
  <page>
    <title>page1</title>
    <revision>
      <id>2621</id>
      <parentid>6</parentid>
      <timestamp>2005-10-09T01:00:18Z</timestamp>
      <contributor>
        <username>Chaos</username>
        <id>2</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">text1</text>
    </revision>
  </page>
  <page>
    <title>page2</title>
    <ns>8</ns>
    <id>7</id>
    <revision>
      <id>2619</id>
      <parentid>2618</parentid>
      <timestamp>2005-10-09T00:56:39Z</timestamp>
      <contributor>
        <username>Chaos</username>
        <id>2</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">text2</text>
    </revision>
  </page>
  <page>
    <title>page3</title>
    <ns>8</ns>
    <id>6</id>
    <revision>
      <id>2621</id>
      <parentid>6</parentid>
      <timestamp>2005-10-09T01:00:18Z</timestamp>
      <contributor>
        <username>Chaos</username>
        <id>2</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">text3</text>
    </revision>
  </page>
</mediawiki>

through my script, Each page must be in a text file whose name is the contents of the tag <title> and contains the text of <text xml:space="preserve"></text>

My code

my $filename = "pages.xml";
my $parser   = XML::LibXML->new();
my $xmldoc   = $parser->parse_file( $filename );
my $file;

foreach my $page ( $xmldoc->findnodes( '/mediawiki/page' ) ) {

    foreach my $title ( $page->findnodes( '/mediawiki/page/title' ) ) {

        foreach my $rev ( $page->findnodes( '/mediawiki/page/revision' ) ) {

            foreach my $text ( $rev->findnodes( 'text/text()' ) ) {

                $file = $title->to_literal();
                my $newfile = "$file.txt";

                open( my $out, '>:utf8', $newfile )
                        or die "Unable to open '$newfile' for write: $!";
                my $texte = $text->data;
                print $out "$text\n";
                close $out;
            }
        }
    }
}

the problem is that every constructed file contains the same text as the last tag <text xml:space="preserve"></text>

I have fixed the formatting of your post and added some indentation to your code. You're very welcome, but please make the effort to do that yourself in the future. If you're asking a large number of people to read and understand your post, it's polite to make it as legible as possible. — Dave Cross, May 31 '17 at 12:07
@DaveCross: I'm sorry to undo all your good work. It seems that I get no notification that there has been a superseding update when using a tablet to amend posts. — Borodin, May 31 '17 at 12:34
@rim: Please make sure to edit the *latest version* of your question. I'm not sure why you made the change, as all you seem to have done is to remove one page from the XML. I've reinstated the edits that Dave Cross and I made, and added your new XML after reformatting it. — Borodin, May 31 '17 at 12:50
@rim: Please take a look at [*What should I do when someone answers my question?*](http://stackoverflow.com/help/someone-answers). I don't expect an immediate acceptance; in fact I recommend that you wait a day or two in case a better answer comes along. But you could at least acknowledge my answer and explain whether my assumptions have been correct. — Borodin, May 31 '17 at 13:21
@rim: You were asked in a comment on your previous question [*reading file line by line*](https://stackoverflow.com/q/44001137/622310) to tidy your code. It was done for you then, too. Before that, in [*Remove the first line from my directory*](https://stackoverflow.com/q/43128122/622310) your poor formatting has been left as it is. I am sure you don't need a nurse to help you with your editing, so please will you try to present your questions a little better? — Borodin, May 31 '17 at 13:54
@rim: I also note that you haven't accepted or even commented on any of the answers that you been offered, ever. I'm raising your account with the moderators. — Borodin, May 31 '17 at 13:55

Borodin · Answer 1 · 2017-05-31T12:50:23.480

Your mistake is nesting all those for loops and not using relative XPath expressions

This should do what you want

use utf8;
use strict;
use warnings 'all';
use feature 'say';

STDOUT->autoflush;

use XML::LibXML;

my $filename = "pages.xml";
my $doc      = XML::LibXML->load_xml( location => $filename );

for my $page ( $doc->findnodes('/mediawiki/page') ) {

    my ($title) = $page->findnodes('title');
    my $file = $title->textContent;

    my ($rev_text) = $page->findnodes('revision/text');
    my $text = $rev_text->textContent;

    open my $fh, '>:utf8', $file
        or die qq{Unable to open "$file" for output: $!};

    print $fh "$text\n";

    close $fh;

    say qq{File "$file" written with "$text"};
}

output

File "page1" written with "text1"
File "page2" written with "text2"
File "page3" written with "text3"

Extract data from an XML file with XML::LibXML

1 Answers1

output