2

I want to read a short line from UTF-8 file and display it in Windows console.

I succeeded with MultiByteToWideChar Winapi function:

void mbtowchar(const char* input, WCHAR* output) {
  int len = MultiByteToWideChar(CP_UTF8, 0, input, -1, NULL, 0);
  MultiByteToWideChar(CP_UTF8, 0, input, -1, output, len);
}

void main() {
  setlocale(LC_ALL,"");
  char in[256];

  FILE* file = fopen("data.txt", "r");
  fgets(in, 255, file);
  fclose(file);

  mbtowchar(in, out);
  printf("%ls",out);
}

...but I failed with ISO mbsrtowcs function (non-ASCII chars are messed):

void main() {
  setlocale(LC_ALL,"");
  char in[256];
  wchar_t out[256];

  FILE* file = fopen("data.txt", "r");
  fgets(in, 255, file);
  fclose(file);

  const char* p = in;
  mbstate_t mbs = 0;
  mbsrtowcs(out, &p, 255, &mbs);

  printf("%ls",out);
}

Do I do something wrong with mbsrtowcs or is there some important difference between these two functions? Is it possible to reliably print UTF-8 in windows console using ISO functions? (Assuming matching console font is installed.)

Notes: I use MinGW gcc compiler. C++ is the last resort solution for me, I'd like to stay with C.

Jan Turoň
  • 31,451
  • 23
  • 125
  • 169
  • Are you sure that `data.txt` is UTF8 encoded? Not sure whether `printf` supports Unicode - there was `%S` specifier if remember correctly. Not sure whether it is for standard `printf` of Win32 `wsprintf`. – i486 Aug 24 '15 at 12:41
  • Yes. Both the source file and the data file are in UTF-8. – Jan Turoň Aug 24 '15 at 12:43
  • I hope your project is compiled with UNICODE definition. You may try to use `wprintf` unicode equivalent of `printf`. See MSDN for details. – i486 Aug 24 '15 at 12:46
  • @i486 `printf` is there only to display the Unicode characters. The OP's question is about recoding UTF-8 characters (coming from outside) to UTF-16 using standard C multi-byte/wide functions. – user4815162342 Aug 26 '15 at 11:16

1 Answers1

4

What's "wrong" with mbsrtowcs is that it converts from a system-defined variable-width encoding of 8-bit characters (char) to a fixed-width array of "wide" characters (wchar_t). Wide characters are today understood as Unicode code points, but "multi-byte" does not necessarily imply UTF-8. On Windows it in fact refers to various pre-Unicode encodings of Asian scripts. Frustratingly, Windows doesn't support UTF-8 as a native "multi-byte" encoding at all, and apparently never will.

Thus attempts to use mbsrtowcs to interpret UTF-8 are doomed to fail on Win32. You will have to use MultiByteToWideChar, as your first snippet does, or switch to some other means of converting UTF-8 to UTF-16. (Since UTF-8 and UTF-16 both encode UCS code points, you could even write a simple routine of your own to do that, if your goal is to avoid depending on proprietary extensions.)

Community
  • 1
  • 1
user4815162342
  • 141,790
  • 18
  • 296
  • 355
  • Why it is doomed? I can't find any link where the details are explained. I mean: if multibyte was originally meant as Asian languages support, why it prevents the UTF-8 support? Isn't it the same principle? – Jan Turoň Aug 24 '15 at 12:38
  • @JanTuroň One of the answers to [SO question I linked to](http://stackoverflow.com/questions/2995111/why-isnt-utf-8-allowed-as-the-ansi-code-page) points to a [blog post](http://www.siao2.com/2006/10/11/816996.aspx) by Michael Kaplan that explains it in some details. In short, supporting UTF-8 would invalidate some assumptions made by the code that implements and works with multi-byte "ANSI" code pages. Since they also consider "ANSI" and multi-byte support obsolete and want to get rid of it, they see no point in investing a huge amount of money and time to extend it to support UTF-8. – user4815162342 Aug 24 '15 at 12:45
  • By *a simple routine* you mean mapper for particular target charset, or is there any universal solution? – Jan Turoň Aug 24 '15 at 12:46
  • 1
    @JanTuroň A universal solution: you can decode UTF-8 to individual code points and and encode those to UTF-16 without having any domain knowledge of the properties of particular code points. See, for example, [this answer](http://stackoverflow.com/a/7154226/1600898) (again C++, but you can use it for the algorithm). – user4815162342 Aug 24 '15 at 12:51