0

I am using JAXB unmarshal method to convert XML data into java objects. Code works but when there is invalid data in one of the XML tags, the method throws exceptions and stops immediately e.g.

org.springframework.oxm.UnmarshallingFailureException: JAXB unmarshalling exception; nested exception is javax.xml.bind.UnmarshalException
 - with linked exception:
[com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 2-byte UTF-8 sequence.]

org.springframework.oxm.UnmarshallingFailureException: JAXB unmarshalling exception; nested exception is javax.xml.bind.UnmarshalException
 - with linked exception:
[org.xml.sax.SAXParseException; lineNumber: 6; columnNumber: 45; An invalid XML character (Unicode: 0x1) was found in the element content of the document.]

Instead of throwing exception straight away, what I want is to convert the XML data as much as it can by replacing invalid data to a space or empty string (i.e. strip them and continue).

Is there a way to make the unmarshaler do that?

Or else try catching the exceptions would be OK (less ideal) if there is a way to let the unmarshaler continues from where it stopped.

Obviously pre-processing the XML to strip out all invalid data first before unmarshaling is another way, not preferable unless there is no other way as it just means processing the XML data twice.

Welcome to use other unmarshaler if JAXB cannot do what I want.

user1589188
  • 5,316
  • 17
  • 67
  • 130

1 Answers1

1

First of all, you're asking about XML that is not well-formed rather than XML that is invalid. XML that is not well-formed violates the rules for being XML (and technically isn't XML). XML that is invalid merely violates the rules given by an XML schema. See Well-formed vs Valid XML for further details.

Given that background, it's easy to see the problem: XML that is not well-formed cannot even be parsed, so all compliant XML tools will be ineffective. (Remember, such data isn't even really XML.) What you should do is fix the problem at its source: Repair the code that's generating the "bad XML."

If fixing the errant code is not possible, then see How to parse invalid (bad / not well-formed) XML?

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thank you for the info. The XML structure I have is totally fine, its just potentially having invalid data within one of the tags (as stated in my question). e.g. invalid data – user1589188 Apr 21 '20 at 02:11
  • @user1589188 Your distinction between structure and data, while descriptively useful, is irrelevant wrt the well-formedness of the XML: The content between the tags has restrictions too, and your data is currently violating those restrictions and causing your textual data not to be XML. – kjhughes Apr 21 '20 at 11:54
  • 1
    Fix the data at its source, or see the linked Q/A on how to deal with "bad XML", or before you try to unmarshal, call a filtering function that does something intelligent with the bad characters: Map them to something good, or delete them. – kjhughes Apr 21 '20 at 11:56