How to printf a unicode string with '%s' specifier?

Question

I am trying to call printf() to output a Unicode character/string using %s, but it doesn't print anything.

If I call printf() like this:

 printf("\xE2\x98\xA0")

I get a ☠.

But, if I use %ls like this:

printf("%ls", "☠")  /* or */
printf("%ls", L"☠") /* or */
printf("%ls", L"\xE2\x98\xA0")

I get nothing printed;

Also, how can I declare a wchar_t string with Unicode characters inside of it? wchar_t wstro[50] = L"☠" doesn't work.

Do I need to malloc() a wchar_t then put Unicode data in it?

Code examples use lower case `s`. Suggest editing your text/title to reflect `s` rather than `S` - if that is really what you are doing. Case matters in C. — chux - Reinstate Monica, Apr 27 '18 at 17:25
Ref: [Unicode Character 'SKULL AND CROSSBONES'](https://www.fileformat.info/info/unicode/char/2620/index.htm) — chux - Reinstate Monica, Apr 27 '18 at 17:35
What does `printf("<%s>\n", u8"\xE2\x98\xA0");` (this needs C11 compiler) or `printf("<%s>\n", "\xE2\x98\xA0");` printf for you? — chux - Reinstate Monica, Apr 27 '18 at 17:49
If `printf("\xE2\x98\xA0")` works but `printf("%ls", L"☠")` doesn't, `printf` is not encoding the Unicode string as UTF-8, and the console isn't handling whatever `printf` is actually encoding to. `E2 98 A0` is the UTF-8 encoded form of `☠`, but `"\xE2\x98\xA0"` is not the same as `L"\xE2\x98\xA0"`. You can always encode Unicode strings to UTF-8 manually, such as with platform APIs or 3rd party libraries (although UTF-8 is not hard to implement from scratch, either). Also, `wchar_t wstro[50] = L"☠"` works fine, so if it is not working for you, you likely have a buggy compiler. — Remy Lebeau, Apr 27 '18 at 20:08
`wchar_t` and `L` prefix is for Windows, but it sounds like you are on Mac or Linux — Barmak Shemirani, Apr 27 '18 at 20:25
For what it's worth, `printf("%ls", "☠")` works for me, on Ubuntu with gcc. The others don't print anything at all. — Arndt Jonasson, Apr 27 '18 at 20:36
What OS are you using? What encoding is the source file saved as? These details matter when dealing with console I/O. — Mark Tolonen, Apr 28 '18 at 00:41
@BarmakShemirani `wchar_t` and the `L` prefix are NOT limited to just Windows. They are part of the language standard and are implemented on every platform. Just not the same way on every platform. — Remy Lebeau, Apr 29 '18 at 06:39
@ArndtJonasson Are you sure you succeeded with `printf("%ls", "☠")`? The format `%ls` is for `wchar_t*`, maybe you tried `printf("%s", "☠")` — Barmak Shemirani, Apr 30 '18 at 02:41
@RemyLebeau I can't succeed with `wchar_t` on [ideone.com](http://ideone.com/7eLCGb) - it's expecting UTF8. The compiler is expected to understand `wchar_t*` strings, but non-Windows system may not know what to do with it. — Barmak Shemirani, Apr 30 '18 at 03:02
Yeah still the same problem, when i printf("%ls\n", wstro) with wstro[2] = "0xC9", i have an error message on stdout "printf: Invalid or incomplete multibyte or wide character" And i am on windows 10, using Clion. At school i am on Mac, it;s the same result. — , Apr 30 '18 at 03:16
@BarmakShemirani No, you are right. I meant to write that `printf("%s", "☠")` works for me. — Arndt Jonasson, Apr 30 '18 at 03:39
Warning: Microsoft does not follow the ISO C standard for `%s` and `%ls` in format strings. It would be useful to say whether you are using a Microsoft implementation or not — M.M, Apr 30 '18 at 03:45

score 5 · Answer 1 · answered Apr 27 '18 at 20:37

5

You are confusing Unicode, with UTF-8, and both with wchar_t.

Unicode is something abstract, with code point, combining characters and other properties.

UTF-8 is a common way to encode Unicode, and that it is compatible with ASCII (in case of ASCII only strings), and compatible with C strings (so zero terminated (and no other 0 bytes in the string). \xE2\x98\xA0 is UTF-8 representation.

The character ☠ is probably also encoded in UTF-8. This depends on your editor, but often editors do not use wchar_t.

So: with UTF-8 you should just use %s and not %ls. So your 3 attempts are wrong.

I general, use UTF-8 and so char* and normal strings functions (just do no break strings at random byte, but this mean also not to break strings after random UTF-8 code points if it is followed by some combining code points.

You may use wchar_t, but usually with protocols that use wchar_t, but especially on this case, you should make extra care, because the size of wchar_t could not be compatible with the desired character size (of expected encoding) [e.g. your system and so wchar_t could be just 2 bytes, but so you can use UCS2, but not UTF-32, or the contrary if the system defines wchar_t as 4 bytes).

So keep things simple and try to use just UTF-8, and use it as normal C strings.

answered Apr 27 '18 at 20:37

Giacomo Catenazzi

8,519
2
24
32

1

Ok now i understand much more what the differences between the 3. I succeeded to print a multibyte character with %ls, but what i don't understand now is why %s can also print a multibyte character. For example, wchar_t x[] = L"a\xEF\xB7\xB0z"; printf("%ls\n", x); print the same as printf("%s\n", "a\xEF\xB7\xB0z"); – Apr 28 '18 at 05:56
But the character printed is encoded on 3bytes : ﷰ = 11101111 10110111 10110000, so why %s can print it ? – Apr 28 '18 at 06:02
Because C see them as normal string. You should not use string length to count of characters, but many properties are similar. A C string is an array of chars (equivalent to bytes on current epoch), terminated with `\0`. UTF-8 is constructed (so by design) to be compatible with C-strings. So `\0` is found only at end of a UTF-8 string. UTF-16 the termination is `\0\0` and `\0` is found e.g. in every ASCII and latin1 characters. So they are not compatible, so there is `wchar_t`. Note already with ASCII (in control character), there were sequences of characters as single entity. – Giacomo Catenazzi Apr 28 '18 at 17:02
*often editors do not use wchar_t* It doesn't have anything to do with the editor. If it's a new editor/compiler on a Unix based system, it most likely understands `wchar_t buf[] = L"☠"`, but it doesn't support printing it. – Barmak Shemirani Apr 30 '18 at 03:07
in those codes you are outputting 3 bytes and your terminal or display environment is interpreting it as UTF-8. – M.M Apr 30 '18 at 03:48
I am sorry, but i still don't understand how i can use the %ls of printf. wchar_t *wstr = L"é"; printf("%ls\n", wstr); doesn't work. The error message on stderror is "printf: Invalid or incomplete multibyte or wide character". I tried to change file encoding of Clion from UTB8 to UTB16 but it's worse. – Apr 30 '18 at 04:02
Do not use the `l` flag in `%s`. UTF-8 are strings of chars not string of wchar_t. – Giacomo Catenazzi Apr 30 '18 at 05:59

score 5 · Answer 2 · answered Apr 30 '18 at 04:56

This answer assumes you are working in MS Windows

It's pretty sad that we're in 2018 and this stuff all still doesn't work properly . But here is the state of things:

printf("\xE2\x98\xA0"); (which is the same as printf("%s", "\xE2\x98\xA0");) works because you are just outputting 3 characters to the output stream. There is no Unicode or special character processing occuring in the C language. It is your terminal environment which looks for UTF-8 strings in the output and chooses display glyphs accordingly.

Similarly, if you wrote the output to a file (using fprintf, or stream redirection) you would see the file contains 0xE2, 0x98, 0xA0 and then you may choose to use a text file viewer that converts UTF-8 to display glyphs.

This part is all fine, and you can (and probably should) write your program to only ever write UTF-8 encoded characters to FILE streams.

The problem starts when we want to output wchar_t characters. In theory this should work:

printf("%ls", L"\u2620");

What is supposed to happen is that wcstombs is called to convert the unicode code point sequence into a multi-byte sequence. But which multi-byte format to use? UTF-8 has become ubiquitous now, but in the past there were also other formats like ShiftJIS, Big-5 etc.

You have to specify the multibyte format by using setlocale. And the details of locales are implementation-defined.

Here's the kicker. There is no C locale supported by Windows for general UTF-8 output. If you try setlocale(LC_CTYPE, ".65001"); it just doesn't work.

You can output certain subsets of Unicode by using a supported locale. For example the MSDN example using Japanese_Japan.932 works, outputting the Unicode input as Shift-JIS. (Not UTF-8).

What's worse is that if you use the Windows API function WideStringToMultiByte, it does accept the "locale" of CP_UTF8. You can use this function to convert L"\u2620"; to a char buffer and printf that, producing UTF-8 output.

But of course you cannot "plug this in" to the FILE stream processing, which only calls wcstombs and not WideStringToMultiByte.

Why didn't they allow ".UTF-8" as a locale for wcstombs? Malicious behaviour? Who knows.

The next thing that should work in theory is:

FILE *fp = fopen("a.txt", "w");
fwide(fp, 1);
fwprintf(fp, L"\u2620");

However in actuality, the MS runtime doesn't actually do anything with fwide; it doesn't support wide-oriented streams. The Microsoft implementations of wprintf family actually just output narrow characters, not wide characters, and they use the same wcstombs method that the narrow printf family did.

So, that code doesn't work, and the code from the Japanese wcstombs example, fwprintf(fp, L"\u3603"); (with the .932 CP set) outputs the multibyte sequence instead of the raw wide character.

To write a UTF-16 file via the stdio.h API you actually have no choice but to use narrow characters and treat it like a binary file.

You can use `wprintf(L"%s", L"☠")`, but only in Visual Studio, where you call [_setmode](https://stackoverflow.com/questions/2492077) first. In other compilers you have to use `WriteConsoleW`. Characters like `'☠'` are not supported with default console font, so you have to change the font too. That's doable in Visual Studio but it needs even more fiddling in other compilers. `fwprintf` will work if you open the file in binary mode (BOM would be very useful). Though I prefer to save the file in UTF8. With Linux `printf("%s", "☠")` or `printf("%s", u8"☠")` will do, probably same with Mac. — Barmak Shemirani, Apr 30 '18 at 05:40
[There's a UTF-8 locale in Windows since 2018](https://stackoverflow.com/a/63454192/995714) — phuclv, Jun 09 '21 at 03:39

Gilles Maisonneuve · Answer 3 · 2021-09-09T09:06:33.493

Code that works for Windows

(I use it)

Environment: W7/64 w/ ConEmu console, W10 Terminal or ConEmu, CP always set to 65001.

Compiler : gcc version 11.2.0 (MinGW-W64 x86_64-posix-seh, built by Brecht Sanders)

DOES NOT WORK in W7 default Windows console even w/ CP65001.

#include <stdio.h>
#include <stdlib.h>                     /* malloc */
#include <string.h>                     /* strlen... */
#include <locale.h>
#include <wchar.h>
//
main( int argc , char *argv[])
{
  // .... code here
  printf("%s", "\u25BA");  /* right triangle */
  // ....
}

Result:

C:\Users\gm\C>gets John MARTHA william
►martha
C:\Users\gm\C>

As previously stated by others, locale does not seem to provide any help in making this work on any consoles I tried.

Other syntaxes (L"\u25BA", printf("%ls",...), ...) did NOT bring me the expected result with this gcc compiler under W7 nor W10 on any consoles I tried.

How to printf a unicode string with '%s' specifier?

3 Answers3

Code that works for Windows