2

I have a problem parsing a XML file which contains special characters like ", <, > or & in attributes of an element. At the moment I use XMLReader with an own ContentHandler. Unfortunatel changing the XML is not an option since I get a huge bunch of files. Any idea what I could do??

Best!

Hans Sperker
  • 1,347
  • 2
  • 15
  • 37

3 Answers3

3

You have to change the XML in order to make it well-formed. The five magic characters must be encoded properly OR wrapped in a CDATA section to tell the parser to allow them to pass.

If the five magic characters are not encoded properly, you aren't receiving well-formed XML. That ought to be the foundation of your contract with users.

Do a one-shot change.

duffymo
  • 305,152
  • 44
  • 369
  • 561
3

It's not XML. Don't call it XML, because you are misleading yourself. You're dealing with a proprietary data syntax, and you are missing out on all the benefits of using XML for data interchange. You can't use any of the wonderful tools that exist for processing XML, because your data is not XML. You're in the dark ages of data interchange that existed before XML was invented, where everyone had to write their own parsers and port them to multiple platforms, at vast cost. It may be expensive to switch from this mess to the modern world of open standards, but the investment will pay off quickly. Just don't let any of the stakeholders delude themselves into thinking that because your syntax is "almost XML", you are almost there in terms of reaping the benefits. XML is all or nothing.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Wouldn't you say that this is one of the benefits of JSON. That the all is a much lower barrier than XML. – WallMobile Jun 04 '13 at 05:19
  • JSON is equally strict about what is and what isn't allowed. The difference is that every JSON parser relaxes the rules in different ways. So it depends whether you are more comfortable in a world where the rules are clear and enforced, or where the rules are clear but you can sometimes get away with breaking them. – Michael Kay Jun 05 '13 at 07:07
0

It's not best practice, but you could use regex to transform your almost-XML into proper XML before you open it with XMLReader. Something along these lines (just using javascript for a quick proof-of-concept):

var xml = '<root><node attr="bad attr chars...<"&>..."/></root>';
xml = xml.replace(/("[^"]*)&([^"]*")/, '$1&amp;$2')
xml = xml.replace(/("[^"]*)<([^"]*")/, '$1&lt;$2')
xml = xml.replace(/("[^"]*)>([^"]*")/, '$1&gt;$2')
xml = xml.replace(/("[^"]*)"([^"]*")/, '$1&quot;$2')
alert(xml);
Community
  • 1
  • 1
twamley
  • 801
  • 2
  • 14
  • 22