I have hard time using mbtowc
, which keeps returning wrong results. It also puzzles me why the function even uses locale? Multibyte unicode chars points are locale independent. I implemented custom conversion function that convert it well, see the code below.
I use GCC 4.8.1 on Windows (where sizeof wchar_t is 2), using Czech locale (cs_CZ). The OEM codepage is windows-1250, console by default uses CP852. These are my results so far:
#include <stdio.h>
#include <stdlib.h>
// my custom conversion function
int u8toint(const char* str) {
if(!(*str&128)) return *str;
unsigned char c = *str, bytes = 0;
while((c<<=1)&128) ++bytes;
int result = 0;
for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
int mask = 1;
for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
result|= (*str&mask)<<(6*bytes);
return result;
}
// data inspecting type for the tests in main()
union data {
wchar_t w;
struct {
unsigned char b1, b2;
} bytes;
} a,b,c;
int main() {
// I tried setlocale here
mbtowc(NULL, 0, 0); // reset internal mb_state
mbtowc(&(a.w),"ř",6); // apply mbtowc
b.w = u8toint("ř"); // apply custom function
c.w = L'ř'; // compare to wchar
printf("\na = %hhx%hhx", a.bytes.b2, a.bytes.b1); // a = 0c5 wrong
printf("\nb = %hhx%hhx", b.bytes.b2, b.bytes.b1); // b = 159 right
printf("\nc = %hhx%hhx", c.bytes.b2, c.bytes.b1); // c = 159 right
getchar();
}
Here are setlocale
settings and the results for a
:
setlocale(LC_CTYPE,"Czech_Czech Republic.1250"); // a = 139 wrong
setlocale(LC_CTYPE,"Czech_Czech Republic.852"); // a = 253c wrong
Why mbtowc
doesn't give 0x159 - the unicode number of ř?