2

I have a super simple XML document encoded in UTF-16 LE.

<?xml version="1.0" encoding="utf-16"?><X id="1" />

I'm loading it in as such (using jcabi-xml):

BOMInputStream bomIn = new BOMInputStream(Main.class.getResourceAsStream("resources/test.xml"), ByteOrderMark.UTF_16LE);
String firstNonBomCharacter = Character.toString((char)bomIn.read());
Reader reader = new InputStreamReader(bomIn, "UTF-16");
String xmlString = IOUtils.toString(reader);
xmlString = xmlString.trim();
xmlString = firstNonBomCharacter + xmlString;
bomIn.close();
reader.close();
final XML xml = new XMLDocument(xmlString);

I have checked that there are no extra BOM/junk symbols (leading or anywhere) by saving out the file and inspecting it with a hex editor. The XML is properly formatted.

However, I still get the following error:

[Fatal Error] :1:40: Content is not allowed in prolog.
Exception in thread "main" java.lang.IllegalArgumentException: Invalid XML: "<?xml version="1.0" encoding="utf-16"?><X id="1" />"
    at com.jcabi.xml.DomParser.document(DomParser.java:115)
    at com.jcabi.xml.XMLDocument.<init>(XMLDocument.java:155)
    at Main.getTransformedString(Main.java:47)
    at Main.main(Main.java:26)
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 40; Content is not allowed in prolog.
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
    at com.jcabi.xml.DomParser.document(DomParser.java:105)
    ... 3 more

I have googled up and down for this error but they all say that it's the BOM's fault, which I have confirmed (to the best of my knowledge) to not be the case. What else could be wrong?

idlackage
  • 2,715
  • 8
  • 31
  • 52
  • If the file is in UTF-16, then shouldn't the BOM be reserving the first two bytes of the file? What if you add another `bomIn.read();` for discarding the second byte? – Mick Mnemonic Mar 17 '16 at 20:47
  • Actually, now that I had a look at the JavaDocs for [`BOMInputStream`](https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html), you should _remove_ the `bomIn.read()` call altogether because the stream discards the BOM for you. – Mick Mnemonic Mar 17 '16 at 20:53
  • @MickMnemonic That's what I thought too, but when I don't call `bomIn.read()` my string turns into something made of nothing but questions marks. Truthfully I'm not too sure exactly how to use `BOMInputStream` but this answer (http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java) writes that calling `read` skips to the first non-bom character (which I forgot to include in my sample code). – idlackage Mar 17 '16 at 20:57
  • If you're consuming the BOM, then the `InputStreamReader` should be told about the endiannness: `Reader reader = new InputStreamReader(bomIn, StandardCharsets.UTF_16LE);` – Mick Mnemonic Mar 17 '16 at 21:06
  • @MickMnemonic Now that allows things to work without calling `bomIn.read()`, thanks! However the actual error itself persists. – idlackage Mar 17 '16 at 21:10
  • ..and `String xmlString = IOUtils.toString(reader, StandardCharsets.UTF_16LE);` – Mick Mnemonic Mar 17 '16 at 21:18
  • Are you sure that your file is truly encoded in utf-16? Just declaring it so in the XML declaration won't work if the file isn't truly utf-16. Also, anything that accidentally consumes a character or two from the XML declaration before parsing begins will result in the "Content not allowed in prolog" error message because the clobbered XML declaration will be considered unrecognized content in the prolog. – kjhughes Mar 17 '16 at 21:21
  • @MickMnemonic Passing in a `Charset` doesn't seem to work as a parameter for `toString`. – idlackage Mar 17 '16 at 21:23
  • @kjhughes Upon checking in Notepad++, it says it's encoded in "UCS-2 LE BOM", which equals UTF-16? Or have I been dreaming? – idlackage Mar 17 '16 at 21:24
  • Try `String xmlString = IOUtils.toString(reader,"UTF-16LE");` – Mick Mnemonic Mar 17 '16 at 21:30
  • interesting question, I thought usually the xml parser should take care of those underlying bom issues, so you don't really have to worry about it?? – vtd-xml-author Mar 17 '16 at 22:09

1 Answers1

2

The following works for me:

    try (InputStream stream = Test.class.getResourceAsStream("/Test.xml")) {
        StreamSource source = new StreamSource(stream);
        final XML xml = new XMLDocument(source);
    }

With the input file's hex dump:

FF FE 3C 00 3F 00 78 00 6D 00 6C 00 20 00 76 00 65 00 72 00 73 00 69 00  
6F 00 6E 00 3D 00 27 00 31 00 2E 00 30 00 27 00 20 00 65 00 6E 00 63 00 
6F 00 64 00 69 00 6E 00 67 00 3D 00 27 00 55 00 54 00 46 00 2D 00 31 00 
36 00 27 00 3F 00 3E 00 3C 00 58 00 20 00 69 00 64 00 3D 00 22 00 31 00 
22 00 2F 00 3E 00

As far as I can tell, in your example you are converting the contents of the file to a string. But this is problematic because you actually throw away the encoding when you convert bytes to string. When the SAX parser converts the string to a byte array, it decides it will be UTF-8, but the prolog states that it is UTF-16 and so you have a problem.

Instead, when I use the StreamSource, it just automatically detects the fact that the file is encoded in UTF-16 LE from the BOM.

If you are not using java-7 or up and cannot use try-with-resources, then use the stream.close() as before.

Aaryn Tonita
  • 490
  • 3
  • 7