-1

I would like to use \u escape sequences in text, but the conversion seems confusing right now.

As far as I understand \u uses notation \uXXXX where X is a hex digit, and describes a codepoint in utf8? plane? But utf8 is a variable length encoding so it's not necessarily 4 digits long?

So how one goes in converting wxString[0] -> '\uXXXX' sequece? Do I use mb_str(wxConvUTF8) or what? All this unicode conversion stuff seems really confusing to me right now.

And what to do with the opposite conversion? If I receive the input with '\uXXXX' sequences, which is the correct way to find them inline, and convert to unicode characters for output?

Hossein
  • 4,097
  • 2
  • 24
  • 46
Coder
  • 3,695
  • 7
  • 27
  • 42
  • 1
    There is no such thing as a 'codepoint in utf8 plane'. Please describe more simply what you are trying to do. Also specify which version of wxWidgets ( 2.8 or 2.9 ) you are using - 2.9 is a lot easier for this stuff. – ravenspoint Mar 28 '12 at 11:47
  • Yes, this stuff is confusing. I agree with ravenspoint, the question would be better if you'd better describe what you're trying to do. The `\u` notation looks like it's a C++11 feature, use `\x` instead. http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c11 – Mark Ransom Mar 28 '12 at 16:33

1 Answers1

1

So how one goes in converting wxString[0] -> '\uXXXX'

You could do this, in wxWidgets v2.9.x

wxString x = L"\x014C";
const char* xbuf = (const char*)x.wc_str();
wxString y = wxString::Format("%s = \\u%02X%02X",x,xbuf[1],xbuf[0]);
wxMessageBox(y,"Unicode test");

Which produces this:

enter image description here

Notice the order that the bytes are accessed in xbuf. This is not cross-platform! It depends on how the bytes are stored in a word on your machine. This is why UTF8 is often used instead of UTF16.

ravenspoint
  • 19,093
  • 6
  • 57
  • 103
  • 1
    What a strange use of fn_str() for something that doesn't look at all like a filename. – VZ. Mar 28 '12 at 18:12
  • "This is why UTF8 is often used instead of UTF16." The reason UTF-8 is "often used" is because it requires no actual work to support for many C or C++ APIs. They just take a `char*` like always; they don't have to take a new string type. – Nicol Bolas Mar 28 '12 at 19:28
  • @NicolBolas An interesting idea. It isn`t true for the Windows API or wxWidgets v2.9.x, both use UTF16 and convert any ASCII or UTF8 string passed to them before doing anything else. For other libraries and OS`s, I will leave it for experts to say. – ravenspoint Mar 28 '12 at 20:09