1

I am parsing an XML document in UTF-8 encoding with Java using VTD-XML.

A small excerpt looks like:

<literal></literal>
<literal></literal>
<literal></literal>

I want to iterate through each literal and print it out to the console. However, what I get is:

¢

I am correctly navigating to each element. The way that I get the text value is by calling:

private static String toNormalizedString(String name, int val, final VTDNav vn) throws NavException {
    String strValue = null;
    if (val != -1) {
        strValue = vn.toNormalizedString(val);
    }
    return strValue;
}

I've also tried vn.getXPathStringVal();, however it yields the same results.

I know that each of the literals above aren't just strings of length one. Rather, they seem to be unicode "characters" composed of two characters. I am able to correctly parse and output the kanji characters if they're length is just one.

My question is - how can I correctly parse and output these characters using VTD-XML? Is there a way to get the underlying bytes of the text between the literal tags so that I can parse the bytes myself?

EDIT

Code to process each line of the XML - converting it to a byte array and then back to a String.

try (BufferedReader br = new BufferedReader(new FileReader("res/sample.xml"))) {
        String line;
        while ((line = br.readLine()) != null) {
            byte[] myBytes = null;

            try {
                myBytes = line.getBytes("UTF-8");
            } catch (UnsupportedEncodingException e) {
                e.printStackTrace();
                System.exit(-1);
            }

            System.out.println(new String(myBytes));
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
waylonion
  • 6,866
  • 8
  • 51
  • 92
  • Have you checked that the XML is correctly converted to `byte[]` representation of a UTF-8 string before you invoked `vtdGen.setDoc(doc)` (or `vtdGen.setDoc_BR(doc)`)? You can also check which encoding VTDNav has currently set via `vtdNav.getEncoding()` and check the returned int value against the constants defined in VTDNav – Roman Vottner Jul 05 '17 at 16:14
  • @RomanVottner - Yes, I did check that vtdNav.getEncoding() == VTDNav.Format_UTF8 is true. Could you elaborate on how I can check that the XML is correctly converted to a byte[] representation of the UTF-8 string? Thanks! – waylonion Jul 05 '17 at 16:18
  • Maybe [this SO post](https://stackoverflow.com/questions/6622226/check-if-a-string-is-valid-utf-8-encoded-in-java) is helpful. Poor mens solution could be to set a breakpoint right at `vtdGen.setDoc(...)` and then see what you get in return on evaluating something along the line `new String(bytes);` inside your IDE while debugging. – Roman Vottner Jul 05 '17 at 16:27
  • @RomanVottner I am able to verify that the XML can be correctly converted to a byte[] representation of a UTF-8 string. I used a buffered reader to read the file line by line - converted it to a byte array and then converted it back to a String. The resulting String is the same as the original. – waylonion Jul 05 '17 at 17:17
  • Not sure if I can help anyfurther on this issue. Maybe file some bug-ticket (or question) to thier issue tracker: https://sourceforge.net/p/vtd-xml/bugs/ – Roman Vottner Jul 05 '17 at 17:25
  • 1
    get 2.13_4. It supports supplementary chars. – vtd-xml-author Jul 19 '17 at 07:20
  • 1
    @Roman Vottner--get 2.13_4 it fixed the supplementary char issue. – vtd-xml-author Jul 19 '17 at 07:21
  • 2
    @vtd-xml-author thanks for the effort, if you could make this fix available via Maven central this would be awesome as version 2.13 dates back to June 2016 – Roman Vottner Jul 19 '17 at 09:20

1 Answers1

2

You are probably trying to get the string involving characters that is greater than 0x10000. That bug is known and is in the process of being addressed... I will notify you once the fix is out. This question may be identical to this one... Map supplementary Unicode characters to BMP (if possible)

vtd-xml-author
  • 3,319
  • 4
  • 22
  • 30
  • thanks for the reply! Yes, it looks like the issue is the same. I used the code from the link that you posted and used the string `` and obtained the same results as the author of that thread (only the lower 16 bits are used). The version of vtd-xml that I'm using is 2.13_2 (the latest I believe). A fix will be very helpful! Thanks! – waylonion Jul 05 '17 at 23:08