Can I use libxml2 with unicode? I want to read and write xml files written in unicode, is it possible using libxml2 with c++?
Asked
Active
Viewed 3,845 times
2
-
Do you want to know if libxml2 can process wchar_t* ? Or do you want to know if it support encoding that is not ASCII 7-bit ? – Sylvain Defresne Mar 14 '11 at 15:25
2 Answers
3
It would appear that the answer is yes.
Use this processing instruction for UTF-8 content*:
<?xml version="1.0" encoding="UTF-8"?>
*which is what I assume you mean by "unicode," since Unicode is not UTF-8.
-
Thanks. I read in the link you gave "xmlChar, the libxml2 data type is a byte, those bytes must be assembled as UTF-8 valid strings." What does that mean? what is bytes assembled as utf-8? – lital maatuk Mar 20 '11 at 18:16
3
libxml2 use utf8 encoding internally to store values, and will convert input from specified encoding (in xml encoding declaration) to utf8 using iconv. So yes, libxml2 can work with unicode in a certain sense.
But if your real question is : does libxml2 accept wchar_t* as input, then the answer is no. You'll have to convert it to a 8 bit encoding (utf8 is probably the safer bet since it can encode every unicode codepoint).

Sylvain Defresne
- 42,429
- 12
- 75
- 85
-
I didn't understand what you mean in "libxml2 uses utf8 encoding internally", what is this internal use? – lital maatuk Mar 14 '11 at 16:01
-
There is multiple way a string containing extended characters can be encoded (iso-8859-1, ascii, shift-jis, utf-8, utf-16, ...). Some of them only cover part of the unicode character set, other cover it completely. In xml, a document can tell what encoding it use (with `` tag). When parsing a document, `libxml2` will convert the document to `utf-8` if it is not already in this encoding before processing and will give you the `utf-8` content. – Sylvain Defresne Mar 14 '11 at 16:06
-
Thanks. so what is the meaning of wchar_t*? where will it come from if not from the xml file? – lital maatuk Mar 14 '11 at 16:08
-
`wchar_t` is a type defined in the `C` standard (and inherited by `C++`) that can represent some wide-character. The encoding and the implementation of those wide-character is implementation dependent, but they are frequently used on Windows (the macro `TCHAR` expand to `wchar_t` when compiled in so-called "unicode" mode). I would advise you not to use them. I was asking because lots of windows programmer make the assumption "unicode == wchar_t" (which is incorrect). So for your usage, I'll say that `libxml2` does support unicode. – Sylvain Defresne Mar 14 '11 at 16:13