Hey, I'm trying to use VTD-XML to parse XML given to it as a String, but I can't find how to do it. Any help would be appreciated.
Asked
Active
Viewed 2,975 times
2 Answers
5
It seems VTD-XML library lets you read byte array data. I'd suggest in that case, convert the String to bytes using the correct encoding.
If there's an encoding signaled in the begining of the XML string:
<?xml version="1.0" encoding="UTF-8"?>
Then use that:
myString.getBytes("UTF-8")
If there's not an encoding, please use one, for VTD-XML know how to decode the bytes:
String withHeader = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" + myString;
byte[] bytes = withHeader.getBytes("UTF-8");
VTDGen vg = new VTDGen();
vg.setDoc(bytes);
vg.parse(true);
Note that in the later case you can use any valid encoding because the string you have in memory is encoding-agnosting (it's in UTF-16 but when you ask for the bytes it will be converted).

Dave Jarvis
- 30,436
- 41
- 178
- 315

helios
- 13,574
- 2
- 45
- 55
-
What method do I then use to load it? setDoc? – Concept Feb 23 '10 at 16:58
-
Got it working. Thanks! Yeah it's a Java String object, it's a really fast parser, and I wasn't happy with the block of if statements that SAX requires. The whole token layout is really handy. – Concept Feb 26 '10 at 14:02
-
I'll add the setDoc method for documenting purposes. – helios Feb 26 '10 at 15:00
2
VTD-XML doesn't accept a string because string implies UCS-16 encoding, which means it is not really a xml document.. as defined by the spec, xml is usually encoded in utf-8, ascii, iso-8859-1 or UTF-16LE or BE format... does my answer make sense?

vtd-xml-author
- 3,319
- 4
- 22
- 30
-
1not really... you define the encoding of the xml file in the ...?> header. And a string is in-memory encoded in UCS-16 but you can transform it to match the encoding required. – helios Feb 24 '10 at 00:06
-
if by string you mean java's String object, then I stand by my answer... if by string you mean an array of bytes, then you are right about using ?> to decide encoding... I feel the question is really about asking about Java's string object, but I could be wrong – vtd-xml-author Feb 24 '10 at 03:49
-
2Does your answer make sense? No. It's possible that the string may contain a prolog which declares an encoding, as helios's answer suggested. So to convert the string to bytes which are suitable for the parser to use, you would have to extract that encoding first, as helios said. But normally it's the parser's job to determine the encoding. All of the parsers I regularly use can accept a Reader as input, which means the parser can ignore the encoding issues because it already gets chars. So if VTD-XML doesn't have a way of parsing from a Reader then it isn't "advanced and powerful". – Paul Clapham Feb 25 '10 at 20:49
-
@Paul: thanks for the comment. I think we should agree on what a string is first. The prolog is to tell the parser what the encoding format is so the byte to char conversion could happen properly. An XML document is a array of bytes, a Reader is just one way to look at it, but not the only one, right? so use Reader to judge teh merit of a parser sounds like a weak argument... – vtd-xml-author Feb 25 '10 at 23:05
-
2I don't think there's any debate about what a string is. And I agree with your unstated argument that it's kind of peculiar to declare the encoding of something which isn't encoded, but it does happen and I don't think it's unusual. But I don't think it should be especially hard for an XML parser to deal with a Reader, and I do think that a parser which makes grandiose claims for itself should be able to do that little thing. – Paul Clapham Feb 26 '10 at 04:58
-
For adding to the very abstract discussion: an XML can be the "abstract XML" (without encoding) or its representation encoded in bytes (and including a ). So a String containing
... is for me a valid enough XML (because it's the abstract ideal). Talking about how the parser can't parse Strings I thing that for optimization it uses the original representation and offsets. Two excludin options arise: using the byte[], using the String. The more basic one is byte[] (because of the files) so the Strings must be first converted (they could provide a converter anyway). – helios Feb 26 '10 at 14:57 -
but I agree that it could receive a String, and 1) call getBytes() 2) "pretend" that xml encoding="UCS-16"?> was read. It's only a method of convenience given that String.getBytes always creates a byte[] you could create yourself. – helios Feb 26 '10 at 14:58
-
-
@helios: thanks for the suggestion. Our view of wellformedness XML is the byte representation of XML, as defined by xml spec, by converting a string into a byte array, it removes any ambiguity of it.. as to your comment on pretending a UCS16, very interesting idea! will have to think about it.. – vtd-xml-author Feb 26 '10 at 20:33
-
@vtd-xml-author It could have a string constructor that looked for the preamble, extracted the encoding, and converted the string to a byte[] with that encoding. If the preamble was missing or did not specify an encoding or String.getBytes(String encoding) threw an exception, it would throw an exception. Otherwise, it would parse the byte[] as it does currently. – David Conrad Mar 08 '13 at 20:27