36

Why does the following program

#include <stdio.h>
#include <wchar.h>

int main() {
  wprintf(L"Привет, мир!");
}

print "Privet, mir!" on Linux? Specifically, why does it transliterate Russian text in Unicode into Latin as opposed to transcoding it into UTF-8 or using replacement characters?

Demonstration of this behavior on Godbolt: https://godbolt.org/z/36zEcG

The non-wide version printf("Привет, мир!") prints this text as expected ("Привет, мир!").

vitaut
  • 49,672
  • 25
  • 199
  • 336
  • 1
    Out of curiosity, why even use `wchar` on Linux? – mcarton Dec 30 '20 at 15:51
  • There is no reason to use `wchar_t` since it's non-portable. I just came across this "interesting" behavior when answering another SO question: https://stackoverflow.com/a/65480111/471164, – vitaut Dec 30 '20 at 16:05
  • In my system, it just prints `??????, ???!`. Could you check `/usr/share/i18n/locales/C` and see if there are any rules starting with `translit` in there? – Heinzi Dec 31 '20 at 11:55
  • @Heinzi, you can check locales on godbolt if interested - there is a link in the question. – vitaut Dec 31 '20 at 15:47

2 Answers2

33

Because conversion of wide characters is done according to the currently set locale. By default a C program always starts with a "C" locale which only supports ASCII characters.

You have to switch to any Russian or UTF-8 locale first:

setlocale(LC_ALL, "ru_RU.utf8"); // Russian Unicode
setlocale(LC_ALL, "en_US.utf8"); // English US Unicode

Or to a current system locale (which is likely what you need):

setlocale(LC_ALL, "");

The full program will be:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main() {
  setlocale(LC_ALL, "ru_RU.utf8");
  wprintf(L"Привет, мир!\n");
}

As for your code working as-is on other machines - this is due to how libc operates there. Some implementations (like musl) do not support non-Unicode locales and thus can unconditionally translate wide characters to an UTF-8 sequence.

Laurel
  • 5,965
  • 14
  • 31
  • 57
  • 2
    It prints prints verbatim `Privet, mir!` when I run it on godbolt with or without `setlocale(LC_ALL, "ru_RU.utf8")` or `setlocale(LC_ALL, "")`. – Jabberwocky Dec 29 '20 at 15:27
  • 3
    But why transliteration? Is it documented somewhere? – vitaut Dec 29 '20 at 15:27
  • 7
    @Jabberwocky Do you have "ru_RU.utf8" locale installed on your computer? If not, then setting it will fail. Use `""` (default locale) which is likely an UTF-8 one. Any unicode locale will do. –  Dec 29 '20 at 15:27
  • @vitaut I am not sure tbh, but I think it is just illegal to output those characters without locale and libc probably can do whatever it wants. Transliteration is a nice way to produce valid and still readable output. –  Dec 29 '20 at 15:28
  • 3
    @Jabberwocky what locale are you using then? Try "en_US.utf8" if you are in US. –  Dec 29 '20 at 15:29
  • After generating `ru_RU.UTF-8` locale, the program works for me. Note that the _first call_ to any `stdout` functions has to be done _after_ `setlocale`. – KamilCuk Dec 29 '20 at 15:31
  • @KamilCuk it should work with default locale too (if it is set to a unicode locale) - it is important to have this program generate expected output on any machine with unicode support and not tie it to a specific language. –  Dec 29 '20 at 15:33
  • One thing to make sure is that you have a utf compatible locale installed using `locale -a` in a terminal. Then select one from the list that command provides. – Jfm Meyers Dec 30 '20 at 16:38
10

why does it transliterate Russian text in Unicode into Latin as opposed to transcoding it into UTF-8 or using replacement characters?

Because the starting locale of your program is the default one, the C locale. So it's translating wide string into C locale. C locale doesn't handle UTF-8 nor any unicode, so your standard library does it's best to translate wide characters into some basic character set used in C locale.

You may change the locale to any UTF-8 locale and the program should output UTF-8 string.

Note: (in implementation I know of) the encoding of the FILE stream is determined and saved at the time the stream orientation (wide vs normal) is chosen. Remember to set the locale before doing anything with stdout (ie. this vs this).

KamilCuk
  • 120,984
  • 8
  • 59
  • 111