0

I'm new to Perl, and I need to learn how to parse a basic XML file (I'm talking REALLY basic, like just a few nested tags). This is for a learning exercise to help us understand some intermediate parsing techniques. So I did what I normally do, google for some examples. However, all the search results use modules like XML::Parser or XML::Simple. I need to do it without modules like this.

Does anyone know of any good sources to find examples of Perl XML parsing WITHOUT these modules. I've heard that using a stack is useful for nested tags (and checking if the tags are closed properly).

Here's an example of something I'd need to parse. I need to be able to extract everything from inside the tags, and of course the name of the tags to go with them:

<?xml version="1.0"?>
 <employee>
   <name>Bill</name>
   <age>22</age>
   <address>123 Bark St.</address>
   <manager>
    <name>Jack</name>
    <age>45</age>
   </manager>
  </employee>
Bob
  • 715
  • 2
  • 11
  • 32
  • 2
    There are no good sources since the right way to do that is to use an XML parser and there isn't build in XML parser in perl, so you need to use a module. – Casimir et Hippolyte May 02 '16 at 20:15
  • 3
    It's a moderately complicated thing to write an XML parser. To learn how it's done, just peruse the CPAN source code for those modules. Speaking for myself, I wouldn't reinvent the wheel on this one. – quest4truth May 02 '16 at 20:16
  • Why do you need to parse XML whithout a parser? – choroba May 02 '16 at 20:17
  • I need to do it without the modules because this is an exercise we're doing to understand some more advanced Perl parsing concepts. – Bob May 02 '16 at 20:39
  • 3
    @Bob Please [edit] that information into your question. The best answer to your question as posed is, "Don't try to re-invent the wheel, install an existing XML parser." If you make it clear that this is an exercise to learn about parsing, on the other hand, you'll get much better answers. You should also add an example of the XML you'll need to parse since that will show just how complex your parser needs to be. – ThisSuitIsBlackNot May 02 '16 at 21:30
  • Take care to add the version of perl you use, since perl6 has capabilities perl5 doesn't have. – Casimir et Hippolyte May 02 '16 at 23:50
  • @ThisSuitIsBlackNot Thanks, I've added it to my post. Sorry if it was unclear before. – Bob May 02 '16 at 23:56
  • 1
    It stays always unclear since you don't say what exactly you want to extract. – Casimir et Hippolyte May 02 '16 at 23:58
  • This sounds like homework. Surely your tutor didn't just drop the problem on you without any advice or explanation? It's hard to know how to help unless we're told what methods you're supposed to understand and what techniques you're practising – Borodin May 03 '16 at 05:55
  • 1
    If the solution your tutor wants you to use happens to be with regular expression, I would print [this](http://stackoverflow.com/a/1732454/1331451) on a huge poster and hang it in the classroom. Or on a t-shirt. – simbabque May 03 '16 at 12:56
  • http://stackoverflow.com/questions/19700843/how-to-pass-xml-data-to-perl-script-without-import-xml-parser-module – ssr1012 May 04 '16 at 14:42

1 Answers1

0

There are no good solutions to 'parsing XML without a parser'. There's only "write a parser" or "use an existing one".

So - writing a parser.

First read the XML spec. It's quite long, and gives you an idea of what you should be able to handle.

Then write some code that implements it. For comparison:

XML::Twig

XML::LibXML

(XML::Simple is also an XML library, but see here for why it's "discouraged")

Note - they're quite long, even if XML::LibXML does make use of a library quite extensively.

Then read this, as to why regular expressions are a bad idea: RegEx match open tags except XHTML self-contained tags

Now, with all that in mind - the task of 'write a parser' is a lot harder than it looks on the surface, because there's more going on in XML than you might think.

This task you are set then, is therefore actually quite a horrible one - because at best all you can realistically accomplish is a sort of fake parser, that works if you limit the XML feature set to what you have been presented. This is bad for all the reasons that a regex based parser is bad - not that you can't make it work, but that it relies on a set of assumptions that are not safe to make.

So - you could write a parser for your XML, and things fairly similar to it - but it wouldn't be an XML parser, and it would be brittle code - which is just nasty.

However, with that in mind - you can sort of fake it, badly, using recursive tag matching. E.g. recurse a new 'level' each time you hit an opening tag, and 'fold up' as you hit a closing tag. One of the nice features of the XML spec is that errors are fatal, so at least you don't have to handle some of the nastier tag nesting scenarios.

Community
  • 1
  • 1
Sobrique
  • 52,974
  • 7
  • 60
  • 101