0

I went through a few posts, like FileReader reads the file as a character stream and can be treated as whitespace if the document is handed as a stream of characters where the answers say the input source is actually a char stream, not a byte stream.

However, the suggested solution from 1 does not seem to apply to UTF-16LE. Although I use this code:

    try (final InputStream is = Files.newInputStream(filename.toPath(), StandardOpenOption.READ)) {
      DOMParser parser = new org.apache.xerces.parsers.DOMParser();
      parser.parse(new InputSource(is));
      return parser.getDocument();
    } catch (final SAXParseException saxEx) {
      LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
    }

I still get org.xml.sax.SAXParseException: Content is not allowed in prolog..

I looked at Files.newInputStream, and it indeed uses a ChannelInputStream which will hand over bytes, not chars. I also tried to set the Encoding of the InputSource object, but with no luck. I also checked that there are not extra chars (except the BOM) before the <?xml part.

I also want to mention that this code works just fine with UTF-8.

// Edit: I also tried DocumentBuilderFactory.newInstance().newDocumentBuilder().parse() and XmlInputStreamReader.next(), same results.

// Edit 2: Tried using a buffered reader. Same results: Unexpected character '뿯' (code 49135 / 0xbfef) in prolog; expected '<'

Thanks in advance.

Benjamin Marwell
  • 1,173
  • 1
  • 13
  • 36
  • What if you remove the BOM at the beginning (skipping the first two bytes)? `... { is.read(): is.read();` – Joop Eggen Sep 10 '19 at 12:53
  • Then I wouldn't be able to read UTF-8 without bom or ISO-8859-1. :( – Benjamin Marwell Sep 10 '19 at 13:00
  • The encoding is given or defaulted UTF-8 in ``. I have heard that in rare cases a BOM gave such a problem. But I do not remember specifics. – Joop Eggen Sep 10 '19 at 13:03
  • I cannot get even this far. I want to read that tag and attribute you refer to. But see my 2nd edit, it stops before that. – Benjamin Marwell Sep 10 '19 at 13:05
  • I double checked. The file is starting with the BOM 0xFF 0xFE. Maybe I need to wrap it into a BOMRemovingInputStream… – Benjamin Marwell Sep 10 '19 at 13:21
  • Shouldn't you set the character encoding on the input source before parsing (`inputSource.setEncoding("UTF-16");`) so it uses the BOM to determine whether it's LE or BE? – Alexander Bollaert Sep 10 '19 at 13:44
  • Yes and no. The XML libraries are mature enough to do this right now. The correct answer was "maven resource filtering". Since maven is set to UTF-8, my tests failed reading from the target/classes folder, but my input (from the src/test/resources-folder) was okay. – Benjamin Marwell Sep 11 '19 at 13:51

1 Answers1

1

To get a bit farther some info gathering:

byte[] bytes = Files.readAllBytes(filename.toPath);
String xml = new String(bytes, StandardCharsets.UTF_16LE);
if (xml.startsWith("\uFEFF")) {
    LOG.info("Has BOM and is evidently UTF_16LE");
    xml = xml.substring(1);
}
if (!xml.contains("<?xml")) {
    LOG.info("Has no XML declaration");
}
String declaredEncoding = xml.replaceFirst("<?xml[^>]*encoding=[\"']([^\"']+)[\"']", "$1");
if (declaredEncoding == xml) {
    declaredEncoding = "UTF-8";
}
LOG.info("Declared as " + declaredEncoding);

try (final InputStream is = new ByteArrayInputStream(xml.getBytes(declaredEncoding))) {
  DOMParser parser = new org.apache.xerces.parsers.DOMParser();
  parser.parse(new InputSource(is));
  return parser.getDocument();
} catch (final SAXParseException saxEx) {
  LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • I looked into the file with `xxd` and I know it is \uFFFE at the beginning. – Benjamin Marwell Sep 10 '19 at 13:23
  • bytes FF FE are in UTF-16LE actually the char `\uFEFF` aka the BOM ( a bit weird Unicode number) – Joop Eggen Sep 10 '19 at 13:24
  • Hey, an aside. As the BOM in UTF-8 bytes is `EF BB BF` that would explain your edit-2: 0xbfef somewhat. – Joop Eggen Sep 10 '19 at 13:29
  • both IntelliJ and `file` show this is a UTF-16LE file. UTF16LE starts with `\uFFFE`. `\uFFFE` is not weired, according to Wikipedia it is the "no character" character. The 2nd edit used UTF-8 interpretation by accident, no change when I put back UTF16LE in again :( – Benjamin Marwell Sep 10 '19 at 13:38
  • Well I could always use https://github.com/gpakosz/UnicodeBOMInputStream – Benjamin Marwell Sep 10 '19 at 13:42
  • (Bytes FF FE is in UTF-16 little endian the number 0xfeff hence U+FEFF or in java `\uFEFF`.) The mere point I wanted to make: (1) the BOM sometimes seems to be problematic, maybe only for UTF-16LE. (2) The declared encoding must agree with the bytes for XML parsing. – Joop Eggen Sep 10 '19 at 13:48
  • 2
    (takes a deeeeeep breath) … MAVEN RESOURCE FILTERING. I was looking at the source all along, but when looking at the target you could see some extra bytes at the beginning o/t file. Very sorry to see this :( – Benjamin Marwell Sep 10 '19 at 14:09