1

I have hard time using mbtowc, which keeps returning wrong results. It also puzzles me why the function even uses locale? Multibyte unicode chars points are locale independent. I implemented custom conversion function that convert it well, see the code below.

I use GCC 4.8.1 on Windows (where sizeof wchar_t is 2), using Czech locale (cs_CZ). The OEM codepage is windows-1250, console by default uses CP852. These are my results so far:

#include <stdio.h>
#include <stdlib.h>

// my custom conversion function
int u8toint(const char* str) {
  if(!(*str&128)) return *str;
  unsigned char c = *str, bytes = 0;
  while((c<<=1)&128) ++bytes;
  int result = 0;
  for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
  int mask = 1;
  for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
  result|= (*str&mask)<<(6*bytes);
  return result;
}

// data inspecting type for the tests in main()
union data {
  wchar_t w;
  struct {
    unsigned char b1, b2;
  } bytes;
} a,b,c;

int main() {
  // I tried setlocale here
  mbtowc(NULL, 0, 0); // reset internal mb_state
  mbtowc(&(a.w),"ř",6); // apply mbtowc
  b.w = u8toint("ř");   // apply custom function
  c.w = L'ř';           // compare to wchar

  printf("\na = %hhx%hhx", a.bytes.b2, a.bytes.b1); // a = 0c5 wrong
  printf("\nb = %hhx%hhx", b.bytes.b2, b.bytes.b1); // b = 159 right
  printf("\nc = %hhx%hhx", c.bytes.b2, c.bytes.b1); // c = 159 right
  getchar();
}

Here are setlocale settings and the results for a:

setlocale(LC_CTYPE,"Czech_Czech Republic.1250"); // a = 139 wrong
setlocale(LC_CTYPE,"Czech_Czech Republic.852"); //  a = 253c wrong

Why mbtowc doesn't give 0x159 - the unicode number of ř?

Jan Turoň
  • 31,451
  • 23
  • 125
  • 169
  • 1
    Putting non-ASCII glyphs in source code is risky, now it matters what encoding your text editor used and what the compiler guessed at. Looks like it was utf8, turns "ř" into 0xc5 0x99. Which is not a multibyte code. But 0xc5 properly produces U+0139 in code page 1250 for the Ĺ glyph and U+253C in code page 852 for the box drawing glyph. Not that clear what u8toint() does but it also seems to assume utf8. Consider switching the console to utf8 as well to stop the bleeding. – Hans Passant Feb 12 '17 at 20:28
  • 1
    `mbtowc` converts from a multibyte character *encoded in the current locale* to a wchar_t. `ř` in CP1250 is `"\xf8"` and in CP852 is `"\xfd"`. Try converting those to wide using the corresponding encoding and you'll get the right answer. – Mark Tolonen Feb 12 '17 at 20:54
  • @HansPassant The source is in UTF-8. Although `chcp 65001` switches the console to UTF-8 output, [the locale can't be switched to UTF-8 on Windows](http://stackoverflow.com/q/4324542/343721). Interestingly, `mbstowcs` converts the bytes in string well using only the default "C" locale. – Jan Turoň Feb 12 '17 at 21:16
  • @MarkTolonen that answers it well. The solution could be to store the source in the same CP as the console locale, which can not be UTF-8 on Windows. Looks like `mbtowc` is doomed to fail on Windows with UTF-8. – Jan Turoň Feb 12 '17 at 21:23
  • The Win32 API [MultiByteToWideChar](https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx) handles UTF-8. – Mark Tolonen Feb 13 '17 at 00:26
  • @HansPassant, setting the console to codepage 65001 is buggy. Even the WSL developers can't make it work. I hope they're at least aware of the problem. It's much worse in Window 7, but there's still a major flaw with this in Windows 10. The console doesn't allow reading non-ASCII from the input buffer in codepage 65001 due to hard-coded ANSI assumptions about the buffer size passed to `WideCharToMultiByte` in conhost.exe. It assumes 1 byte per character (or 2 bytes for DBCS), so it reads 0 bytes (EOF) if even a single non-ASCII character is entered, since that's 2-4 bytes in UTF-8. – Eryk Sun Feb 14 '17 at 15:21

0 Answers0