0

Somebody send me xml 1.0 file. The file has illegal characters in it like &#x1E, i cannot do anything about it, this initial condition is the statement of the problem.

Java parsers (dom4j-1.6.1.jar) of course faile. Tried substitute xml version to 1.1 in the header, it doesn't work. Or is parser version problem, I don't know.

I wonder about possible best solutions.

My workaround at the moment: - regexp the wrong characters before parsing

It's really the only solution? is there any schema or external entity (?) definition I could use? or another parser? The illegal characters are in the attributes. I think CDATA will not work

It's really a nasty problem.

The xml are generated by a Windows Web service framework, I don't know which one. I'm not aware whether there is some simple fix that could be done from the generation side. But it must really simple otherwise, the web service provider will not implement it.

Glasnhost
  • 1,023
  • 14
  • 34
  • 2
    Character references to the control characters #x1 through #x1F are allowed in XML 1.1 (https://www.w3.org/TR/xml11/#sec-xml11). So changing version to 1.1 and using a parser that accepts XML 1.1 might be a solution. See https://stackoverflow.com/questions/9312517/how-can-i-parse-xml-that-confirms-to-the-1-1-spec-using-java-and-xerces. – mzjn May 04 '19 at 11:10
  • 3
    Somebody didn't send you an XML 1.0 file. They sent you a file which was quite similar to an XML 1.0 file. Use of standards for data exchange is a really good idea, it saves everyone a lot of time and money, and that's why standards like XML 1.0 exist. But they're not good if people don't follow the rules. You've got to decide whether to accept the extra cost of processing non-XML data. But the first thing to do is change your mindset: this isn't XML, so don't call it XML, and don't expect to use XML tools to process it. – Michael Kay May 04 '19 at 13:57

0 Answers0