Unfathomable problem with unicode and frameworks

Question

I'm experiencing a very strange problem... The following trivial test code works as it should if it is injected in a single Cocoa application, but when I use it in one of my frameworks, I get absolutely unexpected results...

wchar_t Buf[2048];
wcscpy(Buf, L"/zbxbxklbvasyfiogkhgfdbxbx/bxkfiorjhsdfohdf/xbxasdoipppwejngfd/gjfdhjgfdfdjkg.sdfsdsrtlrt.ljlg/fghlfg");
int len1 = wcslen(L"/zbxbxklbvasyfiogkhgfdbxbx/bxkfiorjhsdfohdf/xbxasdoipppwejngfd/gjfdhjgfdfdjkg.sdfsdsrtlrt.ljlg/fghlfg");
int len2 = wcslen(Buf);

char Buf2[2048];
Buf2[0]=0;
wcstombs(Buf2, Buf, 2048);

// ??? Buf2 == ""
// ??? len1 == len2 == 57, but should be 101

How can this be, have I gone mad? Even if there was a memory corruption, it couldn't possibly corrupt all these values allocated on stack... Why won't even the wcslen(L"MyWideString") work? Changing test string changes its length, but it is always wrong, wcstombs returns -1...

setlocale() is not used anywhere, test string contains only ASCII characters, in order to ease porting I use the -fshort-wchar compiler option, but it works fine in case of a test Cocoa application...

Please help!

I could understand it if the length of wide string was more then 101 bytes, but how can it be less??? — Ryan, Jun 14 '11 at 10:42
Ok, by the looks of things the problem was caused by the -fshort-wchar, which according to Google results in incorrect work of wide string routines... However I still do not understand, why wcscpy() and wcslen() work perfectly fine in a separate test application... — Ryan, Jun 14 '11 at 12:24
You should always call `setlocale(LC_CTYPE, "");` or equivalent before using `wcstombs` and `mbstowcs`. I just ran your code in Linux and get 101 for len1, len2, the result of the `wcstombs` call and for `strlen(Buf2);`. — Kerrek SB, Jun 16 '11 at 01:14
WBUF is a mistake, fixed. Kernek, thanks for the valuable point, can you please tell me, if it will be all right to call setlocale() in application's entry point? Should I restore previous locale when application terminates? — Ryan, Jun 16 '11 at 05:55
By the way, should I really use LC_CTYPE instead of LC_ALL? I don't really get it... — Ryan, Jun 16 '11 at 06:56
@Ryan: Oh, CTYPE is just the minimum amount of locale info needed for the multibyte conversion. If you need the other locale stuff too, by all means use ALL. Yes, definitely call `setlocale` before you do anything string related. You don't need to "change it back" because you're not changing anything outside your program (though you may want to use the "C" locale at some point for whatever reason). If you need lots of concurrent locales, use C++ and ``. — Kerrek SB, Jun 16 '11 at 10:28
@Ryan: Let me clarify the last statement: If you only change CTYPE, you will probably never need to change that to anything else again, because the only purpose of that is to allow to you interpret the argument (and environment) bytestrings correctly via `mbstowcs`. If you change other things, e.g. number formatting, you may want to change that back and forth -- e.g. print French to the user but "C" to the log file... you cannot really "save" the old locale; you start with "C" and then it's up to you to remember what you changed. — Kerrek SB, Jun 16 '11 at 10:34
@Ryan: Perhaps you'd like to check out [my other post](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability)? — Kerrek SB, Jun 16 '11 at 10:35

score 0 · Answer 1 · answered Jun 14 '11 at 10:47

0

Wide char implementation in C/C++ can be anything, including 1 byte, 2 bytes or 4 bytes. This depends on the compiler and the platform you are compiling to.

Probably wikipedia is not the best place to quote from but in this case: http://en.wikipedia.org/wiki/Wide_character states that

... width of wchar_t is compiler-specific and can be as small as 8 bits.

and

... wide characters should be 16-bit values under C90 due to historical compatibility reasons. C and C++ compilers that comply with the 10646-1:2000 Unicode standard generally assume 32-bit values....

So, do not assume and use the sizeof(wchar_t).

answered Jun 14 '11 at 10:47

sorin

161,544
178
535
806

Well, in my case the problem is not related to the sizeof(wchar_t), since it is always 2, I use the same compiler and force 32 bit mode. – Ryan Jun 14 '11 at 11:36
Anyway, I would not use anything that converts to MBS just because, is OS configuration dependen and specially because it can fail. I would even say that MBS are *obsolete*. – sorin Jun 14 '11 at 11:42

score 0 · Accepted Answer · answered Jun 20 '11 at 21:54

I've just tested this again with GCC 4.6. In the standard settings, this works as expected, giving 101 for all the lengths. However, with your option -fshort-wchar I also get unexpected results (51 in my case, and 251 for the final conversion after using setlocale()).

So I looked up the man entry for the option:

Warning: the -fshort-wchar switch causes GCC to generate code that is not binary compatible with code generated without that switch. Use it to conform to a non-default application binary interface.

I think that explains it: When you're linking to the standard library, you are expected to use the correct ABI and type conventions, which you are overriding with that option.

score 0 · Answer 3 · answered Mar 27 '12 at 08:29

0

-fshort-wchar change the compiler's ABI, so you need to recompile glibc, libgcc and all library using wchar_t. Otherwise, wcslen and other functions in glibc are still assume wchar_t is 4 bytes.

see: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42092

answered Mar 27 '12 at 08:29

ASBai

724
2
7
17

Unfathomable problem with unicode and frameworks

3 Answers3