2

I have a XML document from outer source what I need parse every day over and over again with XML::Simple perl module. My script is running from crontab and it works fine if the XML document is healthy. But I get error messages and die if the document is not valid, something like this:

junk after document element at line 740774, column 0, byte 36355798 at /usr/local/lib/perl/5.18.2/XML/Parser.pm line 187.

I found this line in the XML document and it looks like this:

<item>
    <element1>value1</element1>
    <element2>value2</element2>
    value3</element3>
    <element4>value4</element4>
</item>

Can I parse this wrong document without die? Maybe drop this item from the parser with a warning (and not die!) or somehow ignore the errors?

netdjw
  • 5,419
  • 21
  • 88
  • 162
  • 2
    No, you can't parse malformed XML. You need to persuade whoever is creating the data to do it properly, or to fix it yourself before you process it. Are the errors always similar? – Borodin Mar 13 '15 at 13:32
  • no, they are variables... I think it's coming from developers of source system. If they makes mistakes I get malformed XML. – netdjw Mar 13 '15 at 13:43
  • Don't think of it as XML. Think of it as a proprietary syntax invented by the originator. Write a grammar for this syntax, reverse-engineering it if necessary, and then write a parser for this grammar. Expensive, but entirely doable. If you want a cheaper option, persuade the supplier to adopt XML: using standards saves everyone money. – Michael Kay Mar 13 '15 at 14:38
  • Alternatively, just don't use this data feed. After all, if they can't get the syntax right, why should you trust the content? It's probably garbage. – Michael Kay Mar 13 '15 at 14:39
  • It's not alternative. I _need_ to use it. But build an own parser... it's a smart think. Thanks. – netdjw Mar 14 '15 at 14:04

1 Answers1

3

You don't. Malformed XML is a fatal error, and you should absolutely not try to fix it.

It's a fatal error by definition because without it being so, you end up with parsers having to handle all sorts of edge cases. So you should reject the XML, and tell your people upstream to fix it.

See: Dealing with malformed XML

And especially: http://www.xml.com/axml/notes/Draconian.html

We want XML to empower programmers to write code that can be transmitted across the Web and execute on a large number of desktops. However, if this code must include error-handling for all sorts of sloppy end-user practices, it will of necessity balloon in size to the point where it, like Netscape Navigator, or Microsoft Internet Explorer, is tens of megabytes in size, thus defeating the purpose.

In this case - you should also not use XML::Simple which has in it's docs:

The use of this module in new code is discouraged. Other modules are available which provide more straightforward and consistent interfaces.

Basically - XML::Simple lies it isn't a simple XML parser. It's for simple XML. And there's better options.

I would suggest considering something like XML::Twig instead. (There are other options - this is my favourite).

But neither will handle malformed XML - any parser that does is by definition broken.

Community
  • 1
  • 1
Sobrique
  • 52,974
  • 7
  • 60
  • 101
  • 2
    @netdjw: It is very easy for the originator to validate the XML before they send it to you. They can even do it on line at [`xmlvalidation.com`](http://www.xmlvalidation.com/) – Borodin Mar 13 '15 at 14:10