60

From the documentation of XML::Simple:

The use of this module in new code is discouraged. Other modules are available which provide more straightforward and consistent interfaces. In particular, XML::LibXML is highly recommended.

The major problems with this module are the large number of options and the arbitrary ways in which these options interact - often with unexpected results.

Can someone clarify for me what the key reasons for this are?

Community
  • 1
  • 1
Sobrique
  • 52,974
  • 7
  • 60
  • 101
  • 1
    It also may be interesting to hear pros/cons for https://metacpan.org/pod/XML::Fast – mpapec Oct 22 '15 at 05:47
  • Are you creating a knowledge-base article that you can link to on your quest to kill XML::Simple? :D – simbabque Oct 22 '15 at 08:07
  • Who me? Although, if anyone knows how to request add/remove from 'core', that would be interesting.... (As 'it's core' is the major argument _for_ XML::Simple. Just like CGI before it, I think that should change). – Sobrique Oct 22 '15 at 08:20
  • 7
    XML::Simple is not in the Perl core and never has been. In fact, there are no XML parsing modules in Perl core. – stu42j Oct 22 '15 at 15:53
  • OK. Fair point. I would quite like a "core" XML parser. 'spect it might not be so easy though. – Sobrique Oct 22 '15 at 15:55
  • @EvanCarroll - I'm referencing a quote from the docs on `XML::Simple`. But I'll happily accept dissenting opinions as to why the docs are wrong, and it _shouldn't_ be "discouraged". (I hold this view about perl threads). – Sobrique Oct 22 '15 at 16:00
  • I've submitted as an answer instead, because you're actually asking for an explanation I'll assume that you're considering it as an answer to the question. – Evan Carroll Oct 22 '15 at 16:24
  • Yes, that's marvellous. Thank you. – Sobrique Oct 22 '15 at 18:13
  • 15
    As the author of XML::Simple, I discourage its use because there are better solutions which are actually easier to use. I personally use and recommend XML::LibXML and have written a tutorial to help people get started - [XML::LibXML by example](http://grantm.github.io/perl-libxml-by-example/) – Grant McLean Apr 06 '16 at 21:13
  • 1
    Just came back here and read the comments. If you want something to be included in core, you can always suggest in on the p5p mailing list. If you have good arguments, they might go for it. – simbabque Oct 11 '16 at 09:52

3 Answers3

55

The real problem is that what XML::Simple primarily tries to do is take XML, and represent it as a perl data structure.

As you'll no doubt be aware from perldata the two key data structures you have available is the hash and the array.

  • Arrays are ordered scalars.
  • hashes are unordered key-value pairs.

And XML doesn't do either really. It has elements which are:

  • non uniquely named (which means hashes don't "fit").
  • .... but are 'ordered' within the file.
  • may have attributes (Which you could insert into a hash)
  • may have content (But might not, but could be a unary tag)
  • may have children (Of any depth)

And these things don't map directly to the available perl data structures - at a simplistic level, a nested hash of hashes might fit - but it can't cope with elements with duplicated names. Nor can you differentiate easily between attributes and child nodes.

So XML::Simple tries to guess based on the XML content, and takes 'hints' from the various option settings, and then when you try and output the content, it (tries to) apply the same process in reverse.

As a result, for anything other than the most simple XML, it becomes unwieldy at best, or loses data at worst.

Consider:

<xml>
   <parent>
       <child att="some_att">content</child>
   </parent>
   <another_node>
       <another_child some_att="a value" />
       <another_child different_att="different_value">more content</another_child>
   </another_node>
</xml>

This - when parsed through XML::Simple gives you:

$VAR1 = {
          'parent' => {
                      'child' => {
                                 'att' => 'some_att',
                                 'content' => 'content'
                               }
                    },
          'another_node' => {
                            'another_child' => [
                                               {
                                                 'some_att' => 'a value'
                                               },
                                               {
                                                 'different_att' => 'different_value',
                                                 'content' => 'more content'
                                               }
                                             ]
                          }
        };

Note - now you have under parent - just anonymous hashes, but under another_node you have an array of anonymous hashes.

So in order to access the content of child:

my $child = $xml -> {parent} -> {child} -> {content};

Note how you've got a 'child' node, with a 'content' node beneath it, which isn't because it's ... content.

But to access the content beneath the first another_child element:

 my $another_child = $xml -> {another_node} -> {another_child} -> [0] -> {content};

Note how - because of having multiple <another_node> elements, the XML has been parsed into an array, where it wasn't with a single one. (If you did have an element called content beneath it, then you end up with something else yet). You can change this by using ForceArray but then you end up with a hash of arrays of hashes of arrays of hashes of arrays - although it is at least consistent in it's handling of child elements. Edit: Note, following discussion - this is a bad default, rather than a flaw with XML::Simple.

You should set:

ForceArray => 1, KeyAttr => [], ForceContent => 1

If you apply this to the XML as above, you get instead:

$VAR1 = {
          'another_node' => [
                            {
                              'another_child' => [
                                                 {
                                                   'some_att' => 'a value'
                                                 },
                                                 {
                                                   'different_att' => 'different_value',
                                                   'content' => 'more content'
                                                 }
                                               ]
                            }
                          ],
          'parent' => [
                      {
                        'child' => [
                                   {
                                     'att' => 'some_att',
                                     'content' => 'content'
                                   }
                                 ]
                      }
                    ]
        };

This will give you consistency, because you will no longer have single node elements handle differently to multi-node.

But you still:

  • Have a 5 reference deep tree to get at a value.

E.g.:

print $xml -> {parent} -> [0] -> {child} -> [0] -> {content};

You still have content and child hash elements treated as if they were attributes, and because hashes are unordered, you simply cannot reconstruct the input. So basically, you have to parse it, then run it through Dumper to figure out where you need to look.

But with an xpath query, you get at that node with:

findnodes("/xml/parent/child"); 

What you don't get in XML::Simple that you do in XML::Twig (and I presume XML::LibXML but I know it less well):

  • xpath support. xpath is an XML way of expressing a path to a node. So you can 'find' a node in the above with get_xpath('//child'). You can even use attributes in the xpath - like get_xpath('//another_child[@different_att]') which will select exactly which one you wanted. (You can iterate on matches too).
  • cut and paste to move elements around
  • parsefile_inplace to allow you to modify XML with an in place edit.
  • pretty_print options, to format XML.
  • twig_handlers and purge - which allows you to process really big XML without having to load it all in memory.
  • simplify if you really must make it backwards compatible with XML::Simple.
  • the code is generally way simpler than trying to follow daisy chains of references to hashes and arrays, that can never be done consistently because of the fundamental differences in structure.

It's also widely available - easy to download from CPAN, and distributed as an installable package on many operating systems. (Sadly it's not a default install. Yet)

See: XML::Twig quick reference

For the sake of comparison:

my $xml = XMLin( \*DATA, ForceArray => 1, KeyAttr => [], ForceContent => 1 );

print Dumper $xml;
print $xml ->{parent}->[0]->{child}->[0]->{content};

Vs.

my $twig = XML::Twig->parse( \*DATA );
print $twig ->get_xpath( '/xml/parent/child', 0 )->text;
print $twig ->root->first_child('parent')->first_child_text('child');
Sobrique
  • 52,974
  • 7
  • 60
  • 101
  • 5
    _Sadly it's not a default install._ If by "default install" you mean core module, then yes, I agree with you. But if instead you mean bundled with a Perl distribution, Strawberry Perl has included pre-installed XML modules (XML::LibXML, XML::Parser, XML::Twig, etc.) since at least [May 2014](http://strawberryperl.com/release-notes/5.20.0.1-32bit.html), maybe longer. – Matt Jacob Oct 21 '15 at 20:16
  • 7
    IMO it largely boils down to that ForceArray should have defaulted to 1 (and that can't be changed without breaking most existing uses). If XML::Simple meets your needs, there's no reason not to use it. – ysth Oct 21 '15 at 20:26
  • I agree, but narrowly scope "meeting my needs" to "if I can't install one of the other modules", and if a regex hack won't do. Because honestly, I consider it very similar to regular expressions, for the same reason. It will work provided you have a very controlled scope of your input XML. And it might break one day, for no apparent reason. It does solve a problem, and it is a core module. But it is a poor solution when much better options exist – Sobrique Oct 21 '15 at 20:33
  • 5
    @Sobrique: I started to edit your solution, but when I got to the final paragraph and list I had to give up. Your stated aim was to explain why `XML::Simple` is such a poor choice, but you ended up writing fan mail for `XML::Twig`. If you want to go beyond explaining the problems with `XML::Simple` then you need to consider far more than just `XML::Twig` and `XML::LibXML`, and I don't believe this is the place for such extended analysis – Borodin Oct 22 '15 at 03:28
  • 2
    As I dislike offering "don't do X" without offering a suitable alternative, I was trying to offer some positive reasons to switch. Ideally ones that assist a business case. I am a fan of XML::Twig. I think that if they "simply" dropped XML::simple from core, it would be a good replacement. Not least because "simplify" allows you to retain backwards compatibility. That is straying somewhat into opinion I know - there are plenty of other option that are good. – Sobrique Oct 22 '15 at 07:08
33

XML::Simple is the most complex XML parser available

The main problem with XML::Simple is that the resulting structure is extremely hard to navigate correctly. $ele->{ele_name} can return any of the following (even for elements that follow the same spec):

[ { att => 'val', ..., content => [ 'content', 'content' ] }, ... ]
[ { att => 'val', ..., content => 'content' }, ... ]
[ { att => 'val', ..., }, ... ]
[ 'content', ... ]
{ 'id' => { att => 'val', ..., content => [ 'content', 'content' ] }, ... }
{ 'id' => { att => 'val', ..., content => 'content' }, ... }
{ 'id' => { att => 'val', ... }, ... }
{ 'id' => { content => [ 'content', 'content' ] }, ... }
{ 'id' => { content => 'content' }, ... }
{ att => 'val', ..., content => [ 'content', 'content' ] }
{ att => 'val', ..., content => 'content' }
{ att => 'val', ..., }
'content'

This means that you have to perform all kinds of checks to see what you actually got. But the sheer complexity of this encourages developers to make very bad assumptions instead. This leads to all kinds of problems slipping into production, causing live code to fail when corner cases are encountered.

The options for making a more regular tree fall short

You can use the following options to create a more regular tree:

ForceArray => 1, KeyAttr => [], ForceContent => 1

But even with these options, many checks are still needed to extract information from a tree. For example, getting the /root/eles/ele nodes from a document is a common operation that should be trivial to perform, but the following is required when using XML::Simple:

# Requires: ForceArray => 1, KeyAttr => [], ForceContent => 1, KeepRoot => 0
# Assumes the format doesn't allow for more than one /root/eles.
# The format wouldn't be supported if it allowed /root to have an attr named eles.
# The format wouldn't be supported if it allowed /root/eles to have an attr named ele.
my @eles;
if ($doc->{eles} && $doc->{eles}[0]{ele}) {
    @eles = @{ $doc->{eles}[0]{ele} };
}

In another parser, one would use the following:

my @eles = $doc->findnodes('/root/eles/ele');

XML::Simple imposes numerous limitations, and it lacks common features

  • It's completely useless for producing XML. Even with ForceArray => 1, ForceContent => 1, KeyAttr => [], KeepRoot => 1, there are far too many details that can't be controlled.

  • It doesn't preserve the relative order of children with different names.

  • It has limited (with XML::SAX backend) or no (with XML::Parser backend) support for namespaces and namespace prefixes.

  • Some backends (e.g. XML::Parser) are unable to handle encodings not based on ASCII (e.g. UTF-16le).

  • An element can't have a child element and an attribute with the same name.

  • It can't create XML documents with comments.

Ignoring the major issues previously mentioned, XML::Simple could still be usable with these limitations. But why go to the trouble of checking if XML::Simple can handle your document format and risk having to switch to another parser later? You could simply use a better parser for all your documents from the start.

Not only do some other parsers not subject you to these limitations, they provide loads of other useful features in addition. The following are a few features they might have that XML::Simple doesn't:

  • Speed. XML::Simple is extremely slow, especially if you use a backend other than XML::Parser. I'm talking orders of magnitude slower than other parsers.

  • XPath selectors or similar.

  • Support for extremely large documents.

  • Support for pretty printing.

Is XML::Simple ever useful?

The only format for which XML::Simple is simplest is one where no element is optional. I've had experience with countless XML formats, and I've never encountered such a format.

This fragility and complexity alone are reasons enough to warrant staying away from XML::Simple, but there are others.

Alternatives

I use XML::LibXML. It's an extremely fast, full-featured parser. If I ever needed to handle documents that didn't fit into memory, I'd use XML::LibXML::Reader (and its copyCurrentNode(1)) or XML::Twig (using twig_roots).

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • XML::TreePP seems to me to not have the magic guessing XML::Simple has. But you can tell it how to behave exactly. It also is massively simpler to deal with than XML::LibXML and its family. For creating XML I would use XML::TreePP, for parsing external XML content perhaps XML::LibXML if you have giant XMLs and speed is an issue. – nicomen Oct 22 '15 at 14:19
  • 1
    @nicomen, Assuming you use `$tpp->set( force_array => [ '*' ] );`, you need at least `my @eles; if ($doc->{root} && $doc->{root}[0]{eles} && $doc->{root}[0]{eles}[0]{ele}) { @eles = @{ $doc->{root}[0]{eles}[0]{ele} } }` to get the `/root/eles/ele` nodes, and that's assuming there can't be multiple `eles` nodes. That's no different than an optimally configured XML::Simple. (It's way worse without `force_array => [ '*' ]`.) – ikegami Oct 22 '15 at 14:36
  • 1
    @nicomen, You say you'd use XML::TreePP over XML::LibXML for large documents. Why???? That sounds ludicrous to me, but I could be missing something. I haven't benchmarked XML::TreePP, but I suspect it doesn't come near XML::LibXML, large document or otherwise. The issue with large documents is memory, not speed. XML::LibXML does provide an option for large docs (a pull parser) whereas XML::TreePP doesn't. That said, XML::Twig is far better at it. – ikegami Oct 22 '15 at 14:43
  • I might have been unclear, I meant XML::LibXML was good for heavy duty and large documents. For easy writing, and reading I prefer XML::TreePP, but yes, you need to set some sane defaults. – nicomen Oct 23 '15 at 10:52
  • For XML::LibXML users, XML::LibXML::Reader might be easier to use than XML::Twig. – choroba Oct 26 '15 at 07:35
  • @choroba, Maybe, but probably not. XML::Twig's `twig_handlers` returns provides the subtree of the matching elements, whereas XML::LibXML::Reader is barely more than a tokenizer. XML::LibXML::Reader is closer to XML::Parser (the underlying parser used by XML::Twig) rather than XML::Twig. – ikegami Oct 26 '15 at 14:07
  • @ikegami: copyCurrentNode(1) returns what you know from XML::LibXML. – choroba Oct 26 '15 at 21:28
  • @choroba, ah! nice! Updated answer. – ikegami Oct 27 '15 at 14:45
4

I disagree with the docs

I'll dissent and say that XML::Simple is just that.. simple. And, it's always been easy and enjoyable for me to use. Test it with the input you're receiving. So long as the input does not change, you're good. The same people that complain about using XML::Simple complain about using JSON::Syck to serialize Moose. The docs are wrong because they take into account correctness over efficiency. If you only care about the following, you're good:

  • not throwing away data
  • building to a format supplied and not an abstract schema

If you're making an abstract parser that isn't defined by application but by spec, I'd use something else. I worked at a company one time and we had to accept 300 different schemas of XML none of which had a spec. XML::Simple did the job easily. The other options would have required us to actually hire someone to get the job done. Everyone thinks XML is something that is sent in a rigid all encompassing spec'ed format such that if you write one parser you're good. If that's the case don't use XML::Simple. XML, before JSON, was just a "dump this and walk" format from one language to another. People actually used things like XML::Dumper. No one actually knew what was outputted. Dealing with that scenario XML::Simple is greattt! Sane people still dump to JSON without spec to accomplish the same thing. It's just how the world works.

Want to read the data in, and not worry about the format? Want to traverse Perl structures and not XML possibilities? Go XML::Simple.

By extension...

Likewise, for most applications JSON::Syck is sufficient to dump this and walk. Though if you're sending to lots of people, I'd highly suggest not being a douche nozzle and making a spec which you export to. But, you know what.. Sometime you're going to get a call from someone you don't want to talk to who wants his data that you don't normally export. And, you're going to pipe it through JSON::Syck's voodoo and let them worry about it. If they want XML? Charge them $500 more and fire up ye' ole XML::Dumper.

Take away

It may be less than perfect, but XML::Simple is damn efficient. Every hour saved in this arena you can potentially spend in a more useful arena. That's a real world consideration.

The other answers

Look XPath has some upsides. Every answer here boils down to preferring XPath over Perl. That's fine. If you would rather use an a standardized XML domain specific language to access your XML, have at it!

Perl doesn't provide for an easy mechanism to access deeply nested optional structures.

var $xml = [ { foo => 1 } ];  ## Always w/ ForceArray.

var $xml = { foo => 1 };

Getting the value of foo here in these two contexts can be tricky. XML::Simple knows this and that's why you can force the former.. However, that even with ForceArray, if the element isn't there you'll throw an error..

var $xml = { bar => [ { foo => 1 } ] };

now, if bar is optional, You're left accessing it $xml->{bar}[0]{foo} and @{$xml->{bar}}[0] will throw an error. Anyway, that's just perl. This has 0 to do with XML::Simple imho. And, I admitted that XML::Simple is not good for building to spec. Show me data, and I can access it with XML::Simple.

Brad Larson
  • 170,088
  • 45
  • 397
  • 571
Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/93181/discussion-on-answer-by-evan-carroll-why-is-xmlsimple-discouraged). – George Stocker Oct 23 '15 at 14:33
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/93394/discussion-between-evan-carroll-and-ikegami). – Evan Carroll Oct 26 '15 at 17:23
  • I've removed the unnecessary meta-commentary targeted at another user. That doesn't really need to be part of the answer, and if you want to hash this out, take it to chat. – Brad Larson Oct 26 '15 at 17:32