1

i'm just a begginer in perl, and very urgently need to prepare a small script that takes top 3 things from an xml file and puts them in a new one. Here's an example of an xml file:

    <article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>
<article>
  {lot of other stuff here}
</article>

What i'd like to do is to get first 3 items along with all the tags in between and put it into another file. Thanks for all the help in advance regards peter

dusker
  • 580
  • 3
  • 11
  • possible duplicate of [How can I use Perl regular expressions to parse XML data?](http://stackoverflow.com/questions/2950661/how-can-i-use-perl-regular-expressions-to-parse-xml-data) – Quentin Jun 03 '10 at 09:28
  • @SMark: Even if. -- Perl6 regular expressions are *still* the wrong tool for that. ;-) – Tomalak Jun 03 '10 at 09:52

2 Answers2

12

Never ever use Regex to handle markup languages.

The original version of this answer (see below) used XML::XPath. Grant McLean said in the comments:

XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too.

so I made a new version that uses XML::LibXML (thanks, Grant):

use warnings;
use strict;
use XML::LibXML;

my $doc   = XML::LibXML->load_xml(location => 'articles.xml');
my $xp    = XML::LibXML::XPathContext->new($doc->documentElement);
my $xpath = '/articles/article[position() < 4]';

foreach my $article ( $xp->findnodes($xpath) ) {
  # now do something with $article
  print $article.": ".$article->getName."\n";
}

For me this prints:

XML::LibXML::Element=SCALAR(0x346ef90): article
XML::LibXML::Element=SCALAR(0x346ef30): article
XML::LibXML::Element=SCALAR(0x346efa8): article

Links to the relevant documentation:


Original version of the answer, based on the XML::XPath package:

use warnings;
use strict;
use XML::XPath;

my $xp    = XML::XPath->new(filename => 'articles.xml');
my $xpath = '/articles/article[position() < 4]';

foreach my $article ( $xp->findnodes($xpath)->get_nodelist ) {
  # now do something with $article
  print $article.": ".$article->getName ."\n";
}

which prints this for me:

XML::XPath::Node::Element=REF(0x38067b8): article
XML::XPath::Node::Element=REF(0x38097e8): article
XML::XPath::Node::Element=REF(0x3809ae8): article

Have a look at the docs to find out what you can do with them.

Community
  • 1
  • 1
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • 1
    This is one case where a regex could easily do the job though. – Snake Plissken Jun 03 '10 at 11:15
  • 5
    @Snake Plissken: No, it isn't. Regex is *never* the right tool for that kind of job, no matter how "easy" it seems. XPath+Programming Language X (Perl in this case) is, or XSLT is. Regex is not. – Tomalak Jun 03 '10 at 11:20
  • You're being silly. In this case a regex can easily do the job. What are you going to do in the case that someone asks you to copy a non-XML file until something has been seen three times? – Snake Plissken Jun 03 '10 at 11:26
  • I guess there're exceptions from the rule. This will be just a simple job so i guess xml will handle it, i'll not use regex for some hardcore html/xml parsing though. – dusker Jun 03 '10 at 13:20
  • BTW i tried printing $article in foreach loop but it doesn't print anything – dusker Jun 03 '10 at 13:28
  • 2
    @Snake Plissken: I'm not being silly. I'm just trying to avoid being smart about when to use a proper parser. There is a nice XML parser built into Perl, there is absolutely no reason not to use it. (It's not "oh damn, I have to use a parser because this is too complex for regex", it's "oh damn, I can't use a parser because the language I use does not supply one". And the latter is almost never true.) – Tomalak Jun 03 '10 at 13:49
  • Now it's kind of working, when i try to print the contents of $article, then it prints but omits all the tags in between. I'd like it to copy all that's inside tag
    along with values and other tqgs
    – dusker Jun 03 '10 at 16:40
  • Agreed here with Tomalak. Regexp are fine for some cases. Parsing XML is not one of them. – Robert P Jun 03 '10 at 23:49
  • 1
    FYI, XML::XPath is an old and unmaintained module. XML::LibXML is a modern, maintained module with an almost identical API and it's faster too. – Grant McLean Jun 04 '10 at 01:03
  • @Grant McLean: I've made a new version that uses `XML::LibXML`. Please have a look and comment on anything I could improve. – Tomalak Jun 04 '10 at 11:37
0

Here:

 open my $input, "<", "file.xml" or die $!;
 open my $output, ">", "truncated-file.xml" or die $!;
 my $n_articles = 0;
 while (<$input>) {
      print $output $_;
      if (m:</article>:) {
           $n_articles++;
           if ($n_articles >= 3) {
                last;
           }
      }
 }         
 close $input or die $!;
 close $output or die $!;

You really don't need an XML parser to do such a simple job.

Snake Plissken
  • 668
  • 3
  • 8