UTF-16LE encoding and xerces2 Java

Question

I went through a few posts, like FileReader reads the file as a character stream and can be treated as whitespace if the document is handed as a stream of characters where the answers say the input source is actually a char stream, not a byte stream.

However, the suggested solution from 1 does not seem to apply to UTF-16LE. Although I use this code:

    try (final InputStream is = Files.newInputStream(filename.toPath(), StandardOpenOption.READ)) {
      DOMParser parser = new org.apache.xerces.parsers.DOMParser();
      parser.parse(new InputSource(is));
      return parser.getDocument();
    } catch (final SAXParseException saxEx) {
      LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
    }

I still get org.xml.sax.SAXParseException: Content is not allowed in prolog..

I looked at Files.newInputStream, and it indeed uses a ChannelInputStream which will hand over bytes, not chars. I also tried to set the Encoding of the InputSource object, but with no luck. I also checked that there are not extra chars (except the BOM) before the <?xml part.

I also want to mention that this code works just fine with UTF-8.

// Edit: I also tried DocumentBuilderFactory.newInstance().newDocumentBuilder().parse() and XmlInputStreamReader.next(), same results.

// Edit 2: Tried using a buffered reader. Same results: Unexpected character '뿯' (code 49135 / 0xbfef) in prolog; expected '<'

Thanks in advance.

What if you remove the BOM at the beginning (skipping the first two bytes)? `... { is.read(): is.read();` — Joop Eggen, Sep 10 '19 at 12:53
Then I wouldn't be able to read UTF-8 without bom or ISO-8859-1. :( — Benjamin Marwell, Sep 10 '19 at 13:00
The encoding is given or defaulted UTF-8 in ``. I have heard that in rare cases a BOM gave such a problem. But I do not remember specifics. — Joop Eggen, Sep 10 '19 at 13:03
I cannot get even this far. I want to read that tag and attribute you refer to. But see my 2nd edit, it stops before that. — Benjamin Marwell, Sep 10 '19 at 13:05
I double checked. The file is starting with the BOM 0xFF 0xFE. Maybe I need to wrap it into a BOMRemovingInputStream… — Benjamin Marwell, Sep 10 '19 at 13:21
Shouldn't you set the character encoding on the input source before parsing (`inputSource.setEncoding("UTF-16");`) so it uses the BOM to determine whether it's LE or BE? — Alexander Bollaert, Sep 10 '19 at 13:44
Yes and no. The XML libraries are mature enough to do this right now. The correct answer was "maven resource filtering". Since maven is set to UTF-8, my tests failed reading from the target/classes folder, but my input (from the src/test/resources-folder) was okay. — Benjamin Marwell, Sep 11 '19 at 13:51

score 1 · Answer 1 · answered Sep 10 '19 at 13:21

1

To get a bit farther some info gathering:

byte[] bytes = Files.readAllBytes(filename.toPath);
String xml = new String(bytes, StandardCharsets.UTF_16LE);
if (xml.startsWith("\uFEFF")) {
    LOG.info("Has BOM and is evidently UTF_16LE");
    xml = xml.substring(1);
}
if (!xml.contains("<?xml")) {
    LOG.info("Has no XML declaration");
}
String declaredEncoding = xml.replaceFirst("<?xml[^>]*encoding=[\"']([^\"']+)[\"']", "$1");
if (declaredEncoding == xml) {
    declaredEncoding = "UTF-8";
}
LOG.info("Declared as " + declaredEncoding);

try (final InputStream is = new ByteArrayInputStream(xml.getBytes(declaredEncoding))) {
  DOMParser parser = new org.apache.xerces.parsers.DOMParser();
  parser.parse(new InputSource(is));
  return parser.getDocument();
} catch (final SAXParseException saxEx) {
  LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}

answered Sep 10 '19 at 13:21

Joop Eggen

107,315
7
83
138

I looked into the file with `xxd` and I know it is \uFFFE at the beginning. – Benjamin Marwell Sep 10 '19 at 13:23
bytes FF FE are in UTF-16LE actually the char `\uFEFF` aka the BOM ( a bit weird Unicode number) – Joop Eggen Sep 10 '19 at 13:24
Hey, an aside. As the BOM in UTF-8 bytes is `EF BB BF` that would explain your edit-2: 0xbfef somewhat. – Joop Eggen Sep 10 '19 at 13:29
both IntelliJ and `file` show this is a UTF-16LE file. UTF16LE starts with `\uFFFE`. `\uFFFE` is not weired, according to Wikipedia it is the "no character" character. The 2nd edit used UTF-8 interpretation by accident, no change when I put back UTF16LE in again :( – Benjamin Marwell Sep 10 '19 at 13:38
Well I could always use https://github.com/gpakosz/UnicodeBOMInputStream – Benjamin Marwell Sep 10 '19 at 13:42
(Bytes FF FE is in UTF-16 little endian the number 0xfeff hence U+FEFF or in java `\uFEFF`.) The mere point I wanted to make: (1) the BOM sometimes seems to be problematic, maybe only for UTF-16LE. (2) The declared encoding must agree with the bytes for XML parsing. – Joop Eggen Sep 10 '19 at 13:48
2

(takes a deeeeeep breath) … MAVEN RESOURCE FILTERING. I was looking at the source all along, but when looking at the target you could see some extra bytes at the beginning o/t file. Very sorry to see this :( – Benjamin Marwell Sep 10 '19 at 14:09

UTF-16LE encoding and xerces2 Java

1 Answers1