11

I have a String contating binary 0 inside in UTF-8 ("A\u0000B"). JAXB happily marshalls XML document containing such character but then fails to unmarshall it:

final JAXBContext jaxbContext = JAXBContext.newInstance(Root.class);
final Marshaller marshaller = jaxbContext.createMarshaller();
final Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();

Root root = new Root();
root.value = "A\u0000B";

final ByteArrayOutputStream os = new ByteArrayOutputStream();
marshaller.marshal(root, os);

unmarshaller.unmarshal(new ByteArrayInputStream(os.toByteArray()));

The root class is just simple:

@XmlRootElement
class Root { @XmlValue String value; }

Output XML contains binary 0 as well between A and B (in hex: 41 00 42) which causes the following error during unmarshalling:

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 63; 
An invalid XML character (Unicode: 0x0) was found in the element content of the document.

Interestingly using raw DOM API (example) produces escaped 0: A�B but trying to read it back yields similar error. Also 0 (neither binary nor escaped) is not allowed by any XML parser or xmllint (see also: Python + Expat: Error on � entities).

My questions:

But shouldn't mature XML stack in Java (I'm using 1.7.0_05) handle this either by default or by having some simple setting? I'm looking for escaping, ignoring or failing fast - but the default behavior of generating invalid XML is not acceptable. I believe such fundamental functionality should not require any extra coding on the client side.

Community
  • 1
  • 1
Tomasz Nurkiewicz
  • 334,321
  • 69
  • 703
  • 674
  • I recently wrote some test cases to test that I handled the 'An invalid XML character (Unicode: 0x0) ' scenario and my life would have been easier if I'd known I could actually use the Marshaller to add inject the null (rather than editing the String directly) but I doubt that's the reason. – matt freake Oct 08 '12 at 10:34
  • See also http://stackoverflow.com/questions/5815134/invalid-xml-character-during-unmarshall – Catchwa Oct 08 '12 at 11:44

1 Answers1

3

why JAXB/DOM API allows creating invalid XML documents which it can not read back? Shouldn't it fail fast during marshalling?

  1. You would need to ask the implementors.

  2. It is possibly that they thought that the expense of checking every data character serialised was not justified ... especially if the parser is then going to check them all over again.

  3. Having decided to implement the serializer this way (or having just done so by mistake), if they then changed the behaviour to do strict checking by default, they would break existing code that depends on being able to serialise illegal XML.

But shouldn't mature XML stack in Java (I'm using 1.7.0_05) handle this either by default or by having some simple setting?

Not necessarily ... if you accept the reason #2 above. Even a simple settings could have a measurable impact on performance.


Also 0 (neither binary nor escaped) is not allowed by any XML parser or xmllint ...

Quite rightly so! It is forbidden by the XML spec.

However, a more interesting test would be to see what happens when you try to generate XML containing an illegal character using other XML stacks.


is there some elegant and global solution?

If the problem you are trying to solve is how to send a \u0000 or \u000B, then you need to apply some application-specific encoding to the String before you insert it into the DOM. And the other end needs to deploy the equivalent decoding.

If the problem you are trying to solve is how to detect the bad data before it is too late, you could do this with an output stream filter between the serializer and the final output stream. But if you detect the badness, there is no good (i.e. transparent to the XML consumer) way to fix it.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • 2
    The serializer has to check every character if it has to be escaped anyway (e.g. '<', '&'), so adding an additional (configurable) check for the null character should not have much impact on the performance. – jarnbjo Oct 08 '12 at 10:55
  • 1
    Read my 1. answer. Ask the implementors! – Stephen C Oct 08 '12 at 10:57
  • 1
    Thank you for your thorough reply. I can't believe performance was an issue, but that's hard to answer, agree. However I can't agree with closing the question as not constructive. I am not only asking *why?* (I thought there is some documented reason for that - where and the answer could be very constructive) but also how to work around this behaviour or solve the problem. Thanks anyway. – Tomasz Nurkiewicz Oct 08 '12 at 17:48
  • 1
    My reason for voting to close are questions like this: *"why JAXB/DOM API allows creating invalid XML documents which it can not read back? Shouldn't it fail fast during marshalling?"* and *"But shouldn't mature XML stack in Java (I'm using 1.7.0_05) handle this either by default or by having some simple setting?"*. These are patently not objectively answerable ... and (IMO) an invitation for non-constructive debate. – Stephen C Oct 09 '12 at 07:33