wxWidgets and converting to and from unicode code points

Question

I would like to use \u escape sequences in text, but the conversion seems confusing right now.

As far as I understand \u uses notation \uXXXX where X is a hex digit, and describes a codepoint in utf8? plane? But utf8 is a variable length encoding so it's not necessarily 4 digits long?

So how one goes in converting wxString[0] -> '\uXXXX' sequece? Do I use mb_str(wxConvUTF8) or what? All this unicode conversion stuff seems really confusing to me right now.

And what to do with the opposite conversion? If I receive the input with '\uXXXX' sequences, which is the correct way to find them inline, and convert to unicode characters for output?

There is no such thing as a 'codepoint in utf8 plane'. Please describe more simply what you are trying to do. Also specify which version of wxWidgets ( 2.8 or 2.9 ) you are using - 2.9 is a lot easier for this stuff. — ravenspoint, Mar 28 '12 at 11:47
Yes, this stuff is confusing. I agree with ravenspoint, the question would be better if you'd better describe what you're trying to do. The `\u` notation looks like it's a C++11 feature, use `\x` instead. http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c11 — Mark Ransom, Mar 28 '12 at 16:33

ravenspoint · Answer 1 · 2012-03-28T19:23:41.983

1

So how one goes in converting wxString[0] -> '\uXXXX'

You could do this, in wxWidgets v2.9.x

wxString x = L"\x014C";
const char* xbuf = (const char*)x.wc_str();
wxString y = wxString::Format("%s = \\u%02X%02X",x,xbuf[1],xbuf[0]);
wxMessageBox(y,"Unicode test");

Which produces this:

enter image description here

Notice the order that the bytes are accessed in xbuf. This is not cross-platform! It depends on how the bytes are stored in a word on your machine. This is why UTF8 is often used instead of UTF16.

edited Mar 28 '12 at 19:23

answered Mar 28 '12 at 13:00

ravenspoint

19,093
6
57
103

1

What a strange use of fn_str() for something that doesn't look at all like a filename. – VZ. Mar 28 '12 at 18:12
"This is why UTF8 is often used instead of UTF16." The reason UTF-8 is "often used" is because it requires no actual work to support for many C or C++ APIs. They just take a `char*` like always; they don't have to take a new string type. – Nicol Bolas Mar 28 '12 at 19:28
@NicolBolas An interesting idea. It isn`t true for the Windows API or wxWidgets v2.9.x, both use UTF16 and convert any ASCII or UTF8 string passed to them before doing anything else. For other libraries and OS`s, I will leave it for experts to say. – ravenspoint Mar 28 '12 at 20:09

wxWidgets and converting to and from unicode code points

1 Answers1