2

I have found this JCabi snippet code that works well with UTF-8 xml encoded files, it basically reads the xml file and then prints it as a string.

            XML xml;
            try {
                xml = new XMLDocument(new File("test8.xml"));
                String xmlString = xml.toString();
                System.out.println(xmlString);
            } catch (FileNotFoundException e1) {
                e1.printStackTrace();
            }

However I need this to run this same code on a UTF-16 encoded xml it gives me the following error:

[Fatal Error] :1:1: Content is not allowed in prolog. Exception in thread "AWT-EventQueue-0" java.lang.IllegalArgumentException: Can't parse, most probably the XML is invalid

Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

I have read about this error and this means that the parser it is not recognizing the prolog because it's seeing characters that are not supposed to be there because of the encoding.

I have tried other libraries that offer a way to "tell" the class which encoding the source file is encoded in, but the only library I was able to get it to work to some degree was JCabi, but I was not able to find a way to tell it that my source file is encoded in UTF-16.

Thanks, any help is appreciated.

  • 1
    Whatever program was used to create the UTF-16 file seems to have added a [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) (Byte Order Mark) at the beginning. XML doesn't allow for that, so remove the BOM (first 2 bytes). A text editor that understands UTF-16 should have options for saving the file without the BOM, but might default to saving the file with a BOM. --- Make sure the prolog specifies the correct encoding, e.g. that the XML file starts with ``. – Andreas May 26 '21 at 01:49
  • Hi I have checked this and the file starts just like that, and it's encoded in notepad as UTF-16. If i take this same file and save it with notepad as UTF-8 it works..... – Gilberto Melo Jr May 26 '21 at 15:12

1 Answers1

1

The jcabi XMLDocument has various constructors including one which takes a string. So one approach is to use:

Path path = Paths.get("test16_LE_with_bom.xml");
XML xml = new XMLDocument(Files.readString(path, StandardCharsets.UTF_16LE));
String xmlString = xml.toString();
System.out.println(xmlString);

This makes use of java.nio.charset.StandardCharsets and java.nio.file.Files.

In my first test, my XML file was encoded as UTF-16-LE (and with a BOM at the start: FF FE for little-endian). The above approach handled the BOM OK.

My test file's prolog is as follows (with no explicit encoding - maybe that's a bad thing, here?):

<?xml version="1.0"?>

In my second test I removed the BOM and re-ran with the updated file - which also worked.

I used Notepad++ and a hex editor to verify/select encodings & to edit the test files.

Your file may be different from my test files (BE vs. LE).

andrewJames
  • 19,570
  • 8
  • 19
  • 51
  • I tried using this `Files.readString(path, StandardCharsets.UTF_16LE)` but it seems it's only supported in Java 11, I'll upgrade my Java so that I can try that. Thanks – Gilberto Melo Jr May 26 '21 at 19:36
  • 1
    If you cannot upgrade, there are alternatives for Java 8 (and even Java 7) [here](https://mkyong.com/java/java-convert-file-to-string/) - as well as 3rd party library options. – andrewJames May 26 '21 at 19:47