Parsing XML file with perl - regex

Question

i'm just a begginer in perl, and very urgently need to prepare a small script that takes top 3 things from an xml file and puts them in a new one. Here's an example of an xml file:

    <article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>

What i'd like to do is to get first 3 items along with all the tags in between and put it into another file. Thanks for all the help in advance regards peter

possible duplicate of [How can I use Perl regular expressions to parse XML data?](http://stackoverflow.com/questions/2950661/how-can-i-use-perl-regular-expressions-to-parse-xml-data) — Quentin, Jun 03 '10 at 09:28
@SMark: Even if. -- Perl6 regular expressions are *still* the wrong tool for that. ;-) — Tomalak, Jun 03 '10 at 09:52

score 12 · Accepted Answer · edited May 23 '17 at 12:30

12

Never ever use Regex to handle markup languages.

The original version of this answer (see below) used XML::XPath. Grant McLean said in the comments:

XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.

so I made a new version that uses XML::LibXML (thanks, Grant):

use warnings;
use strict;
use XML::LibXML;

my $doc   = XML::LibXML->load_xml(location => 'articles.xml');
my $xp    = XML::LibXML::XPathContext->new($doc->documentElement);
my $xpath = '/articles/article[position() < 4]';

foreach my $article ( $xp->findnodes($xpath) ) {
  # now do something with $article
  print $article.": ".$article->getName."\n";
}

For me this prints:

XML::LibXML::Element=SCALAR(0x346ef90): article
XML::LibXML::Element=SCALAR(0x346ef30): article
XML::LibXML::Element=SCALAR(0x346efa8): article

Links to the relevant documentation:

The type of $doc will be XML::LibXML::Document.
The type of $xp is XML::LibXML::XPathContext.
The return type of $xp->findnodes() is XML::LibXML::NodeList.
The type $article is XML::LibXML::Element.

Original version of the answer, based on the XML::XPath package:

use warnings;
use strict;
use XML::XPath;

my $xp    = XML::XPath->new(filename => 'articles.xml');
my $xpath = '/articles/article[position() < 4]';

foreach my $article ( $xp->findnodes($xpath)->get_nodelist ) {
  # now do something with $article
  print $article.": ".$article->getName ."\n";
}

which prints this for me:

XML::XPath::Node::Element=REF(0x38067b8): article
XML::XPath::Node::Element=REF(0x38097e8): article
XML::XPath::Node::Element=REF(0x3809ae8): article

The type of $xp is XML::XPath, obviously.
The return type of $xp->findnodes() is XML::XPath::NodeSet.
The type of $article will be XML::XPath::Node::Element in this case.

Have a look at the docs to find out what you can do with them.

edited May 23 '17 at 12:30

Community

1
1

answered Jun 03 '10 at 09:32

Tomalak

332,285
67
532
628

1

This is one case where a regex could easily do the job though. – Snake Plissken Jun 03 '10 at 11:15
5

@Snake Plissken: No, it isn't. Regex is *never* the right tool for that kind of job, no matter how "easy" it seems. XPath+Programming Language X (Perl in this case) is, or XSLT is. Regex is not. – Tomalak Jun 03 '10 at 11:20
You're being silly. In this case a regex can easily do the job. What are you going to do in the case that someone asks you to copy a non-XML file until something has been seen three times? – Snake Plissken Jun 03 '10 at 11:26
I guess there're exceptions from the rule. This will be just a simple job so i guess xml will handle it, i'll not use regex for some hardcore html/xml parsing though. – dusker Jun 03 '10 at 13:20
BTW i tried printing $article in foreach loop but it doesn't print anything – dusker Jun 03 '10 at 13:28
2

@Snake Plissken: I'm not being silly. I'm just trying to avoid being smart about when to use a proper parser. There is a nice XML parser built into Perl, there is absolutely no reason not to use it. (It's not "oh damn, I have to use a parser because this is too complex for regex", it's "oh damn, I can't use a parser because the language I use does not supply one". And the latter is almost never true.) – Tomalak Jun 03 '10 at 13:49
Now it's kind of working, when i try to print the contents of $article, then it prints but omits all the tags in between. I'd like it to copy all that's inside tag
along with values and other tqgs
– dusker Jun 03 '10 at 16:40
Agreed here with Tomalak. Regexp are fine for some cases. Parsing XML is not one of them. – Robert P Jun 03 '10 at 23:49
1

FYI, XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too. – Grant McLean Jun 04 '10 at 01:03
@Grant McLean: I've made a new version that uses `XML::LibXML`. Please have a look and comment on anything I could improve. – Tomalak Jun 04 '10 at 11:37

score 0 · Answer 2 · answered Jun 03 '10 at 11:24

0

Here:

 open my $input, "<", "file.xml" or die $!;
 open my $output, ">", "truncated-file.xml" or die $!;
 my $n_articles = 0;
 while (<$input>) {
      print $output $_;
      if (m:</article>:) {
           $n_articles++;
           if ($n_articles >= 3) {
                last;
           }
      }
 }         
 close $input or die $!;
 close $output or die $!;

You really don't need an XML parser to do such a simple job.

answered Jun 03 '10 at 11:24

Snake Plissken

668
3
8

What that script did is it copied all the contents of the file.xml into truncated-file.xml – dusker Jun 03 '10 at 13:19
Then it's debugging time for you. Anyway there is another answer you can use if this doesn't work. – Snake Plissken Jun 04 '10 at 03:16
I was referring to the other answer on this thread: http://stackoverflow.com/questions/2964637/parsing-xml-file-with-perl-regex/2964681#2964681 – Snake Plissken Jun 04 '10 at 07:35

Parsing XML file with perl - regex

2 Answers2