Parsing Invalid XML Characters using XStream parser - Java

Question

I am having a classic XML validation question - I need to parse incoming XML (from other applications - which don't use proper XML formatter) where there are Broken Tags and XML Special characters embedded in Data (but not using CDData tag to wrap around)

I am using simple XStream parser to unmarshall the incoming stream as it's simple serialization and not a strict parser. For special characters it throws ConverterException and won't parse the file.

I want to know if there is any other parser which can be used to parse Invalid XML files (special characters etc) We have no control over what would be sent as Input stream and as a part of auditing application, need to read as much Good records from the incoming file as possible.

Is there a better parsing option available or do I need to write Custom Parser to parse these files? I am using Spring Batch to do batch processing and XStream(1.x) to parse the XML files.

AS XSD validation is failing, I am wondering even if it's worth to explore other parsers/ Custom parser option..

Looking for your expert opinions on XML Validations..

score 2 · Answer 1 · answered Jun 11 '14 at 07:03

I understand that you trying to make best of messy input. Unfortunately, since there doesn't seem to be a clear specification of the format of that input, you are actually on your own. An approach could be to first convert the input files to valid XML, which is basically what you would do by writing your own parser. In Java you could do this by reading and parsing the files using your own specialized code and output a standard Java XML interface (SAX, DOM, etc.). But, depending on your knowledge, it may be faster to use a different language specialized in text parsing.

My experience is that the only real long-term solution here is to force the data suppliers to provide valid XML. The reason for this is that, although you can do your best in making valid data out of the invalid data, there is always the risk that your interpretation is wrong. And half-valid data is often worse than no data at all. IMHO it is best to leave the responsibility for correct data at the suppliers.

I dunno... I do agree that the supplier is to blame, but there are a couple of issues. (1) The companies generating the invalid data are often big companies like Microsoft and Apple who offer no proper means of reporting bugs and rarely if ever fix anything. (2) We deal mostly with historical data, so even if every company in the world fixed all their code today and every user updated, someone still has to do something about all the existing documents, and that ultimately ends up being us. :( — Hakanai, Dec 18 '14 at 01:06

Parsing Invalid XML Characters using XStream parser - Java

1 Answers1