45

I'm writing a cross-platform application in C++. All strings are UTF-8-encoded internally. Consider the following simplified code:

#include <string>
#include <iostream>

int main() {
    std::string test = u8"Greek: αβγδ; German: Übergrößenträger";
    std::cout << test;

    return 0;
}

On Unix systems, std::cout expects 8-bit strings to be UTF-8-encoded, so this code works fine.

On Windows, however, std::cout expects 8-bit strings to be in Latin-1 or a similar non-Unicode format (depending on the codepage). This leads to the following output:

Greek: ╬▒╬▓╬│╬┤; German: ├£bergr├Â├ƒentr├ñger

What can I do to make std::cout interpret 8-bit strings as UTF-8 on Windows?

This is what I tried:

#include <string>
#include <iostream>
#include <io.h>
#include <fcntl.h>

int main() {
    _setmode(_fileno(stdout), _O_U8TEXT);
    std::string test = u8"Greek: αβγδ; German: Übergrößenträger";
    std::cout << test;

    return 0;
}

I was hoping that _setmode would do the trick. However, this results in the following assertion error in the line that calls operator<<:

Microsoft Visual C++ Runtime Library

Debug Assertion Failed!

Program: d:\visual studio 2015\Projects\utf8test\Debug\utf8test.exe File: minkernel\crts\ucrt\src\appcrt\stdio\fputc.cpp Line: 47

Expression: ( (_Stream.is_string_backed()) || (fn = _fileno(_Stream.public_stream()), ((_textmode_safe(fn) == __crt_lowio_text_mode::ansi) && !_tm_unicode_safe(fn))))

For information on how your program can cause an assertion failure, see the Visual C++ documentation on asserts.

Daniel Wolf
  • 12,855
  • 13
  • 54
  • 80
  • Have you tried [`std::setlocale`](http://en.cppreference.com/w/cpp/locale/setlocale)? – txtechhelp Aug 08 '17 at 19:02
  • @txtechhelp: I just tried `std::setlocale(LC_ALL, "en_US.UTF-8");`. It had no effect whatsoever. – Daniel Wolf Aug 08 '17 at 19:09
  • 1
    Did You check,source is compiled in expected encoding? generally safest way to write multi national source is use UTF hex codes. Default Visual Studio project in Polish windows assume 1250 sources by deafult – Jacek Cz Aug 08 '17 at 19:50
  • You are best off forgetting UTF-8 on Windows, most of its APIs simply do not support UTF-8. Convert your UTF-8 `std::string` to a UTF-16 `std::wstring` (such as with `std::wstring_convert`) and use `std::wcout` instead. And make sure you are using a Unicode font in the console. – Remy Lebeau Aug 08 '17 at 21:24
  • @JacekCz I double checked, the output shown is consistent with UTF-8 bytes being displayed in Code Page 850. – Mark Ransom Aug 08 '17 at 21:53

8 Answers8

35

At last, I've got it working. This answer combines input from Miles Budnek, Paul, and mkluwe with some research of my own. First, let me start with code that will work on Windows 10. After that, I'll walk you through the code and explain why it won't work out of the box on Windows 7.

#include <string>
#include <iostream>
#include <Windows.h>
#include <cstdio>

int main() {
    // Set console code page to UTF-8 so console known how to interpret string data
    SetConsoleOutputCP(CP_UTF8);

    // Enable buffering to prevent VS from chopping up UTF-8 byte sequences
    setvbuf(stdout, nullptr, _IOFBF, 1000);

    std::string test = u8"Greek: αβγδ; German: Übergrößenträger";
    std::cout << test << std::endl;
}

The code starts by setting the code page, as suggested by Miles Budnik. This will tell the console to interpret the byte stream it receives as UTF-8, not as some variation of ANSI.

Next, there is a problem in the STL code that comes with Visual Studio. std::cout prints its data to a stream buffer of type std::basic_filebuf. When that buffer receives a string (via std::basic_streambuf::sputn()), it won't pass it on to the underlying file as a whole. Instead, it will pass each byte separately. As explained by mkluwe, if the console receives a UTF-8 byte sequence as individual bytes, it won't interpret them as a single code point. Instead, it will treat them as multiple characters. Each byte within a UTF-8 byte sequence is an invalid code point on its own, so you'll see �'s instead. There is a related bug report for Visual Studio, but it was closed as By Design. The workaround is to enable buffering for the stream. As an added bonus, that will give you better performance. However, you may now need to regularly flush the stream as I do with std::endl, or your output may not show.

Lastly, the Windows console supports both raster fonts and TrueType fonts. As pointed out by Paul, raster fonts will simply ignore the console's code page. So non-ASCII Unicode characters will only work if the console is set to a TrueType Font. Up until Windows 7, the default is a raster font, so the user will have to change it manually. Luckily, Windows 10 changes the default font to Consolas, so this part of the problem should solve itself with time.

Daniel Wolf
  • 12,855
  • 13
  • 54
  • 80
  • Output buffering of `stdout` is not a solution. Once the buffer gets full, it is not guaranteed that it ends with a complete UTF-8 byte sequence and your output will still appear broken. – mkluwe Aug 11 '17 at 09:01
  • I know. A better solution would be to create a new class that acts as stream buffer. That shouldn't be hard, but I'm afraid to get some obscure detail wrong (synchronization etc.). With output buffering, I have to flush every once in a while to prevent the buffer running full. – Daniel Wolf Aug 11 '17 at 10:54
  • I added a stringbuf based example to my answer. – mkluwe Aug 11 '17 at 11:52
17

The problem is not std::cout but the windows console. Using C-stdio you will get the ü with fputs( "\xc3\xbc", stdout ); after setting the UTF-8 codepage (either using SetConsoleOutputCP or chcp) and setting a Unicode supporting font in cmd's settings (Consolas should support over 2000 characters and there are registry hacks to add more capable fonts to cmd).

If you output one byte after the other with putc('\xc3'); putc('\xbc'); you will get the double tofu as the console gets them interpreted separately as illegal characters. This is probably what the C++ streams do.

See UTF-8 output on Windows console for a lenghty discussion.

For my own project, I finally implemented a std::stringbuf doing the conversion to Windows-1252. I you really need full Unicode output, this will not really help you, however.

An alternative approach would be overwriting cout's streambuf, using fputs for the actual output:

#include <iostream>
#include <sstream>

#include <Windows.h>

class MBuf: public std::stringbuf {
public:
    int sync() {
        fputs( str().c_str(), stdout );
        str( "" );
        return 0;
    }
};

int main() {
    SetConsoleOutputCP( CP_UTF8 );
    setvbuf( stdout, nullptr, _IONBF, 0 );
    MBuf buf;
    std::cout.rdbuf( &buf );
    std::cout << u8"Greek: αβγδ\n" << std::flush;
}

I turned off output buffering here to prevent it to interfere with unfinished UTF-8 byte sequences.

mkluwe
  • 3,823
  • 2
  • 28
  • 45
  • This seems to be part of the problem indeed. If I use `SetConsoleOutputCP(CP_UTF8);` as suggested by Miles **and** switch to a non-raster font as suggested by Paul **and** use `fputs` instead of `std::cout`, it works! -- Now I need to find out whether there's a way to get `std::cout` to behave correctly. – Daniel Wolf Aug 09 '17 at 15:35
  • I don't think there is a way. And `fputs` is not guaranteed to work either, see my double `putc` example. You _could_ try to change `cout`'s `streambuf` (see `rdbuf()`) with one understanding UTF-8 (keeping the characters together) and using `fputs`. – mkluwe Aug 09 '17 at 16:14
  • I found that this behavior can be fixed by enabling buffering; see [my answer](https://stackoverflow.com/a/45622802/52041). Thanks for pointing me in the right direction! – Daniel Wolf Aug 10 '17 at 20:25
  • Regarding your edit: I'm afraid it's not working for me. I'm getting "Greek: ╬▒╬▓╬│╬┤". – Daniel Wolf Aug 13 '17 at 18:42
  • Are you testing from within Visual Studio? I noticed that this only works starting the program directly from a cmd instance. – mkluwe Aug 13 '17 at 18:46
  • For me, this works neither from VS nor from the console. – Daniel Wolf Aug 14 '17 at 06:55
  • I'm sorry, a had to change `SetConsoleCP` to `SetConsoleOutputCP` to change the *output* codepage. Auto-complete induced typo. – mkluwe Aug 14 '17 at 08:40
  • It's working fine now. I'd like to accept your answer, but I feel that two things are missing from it. If you could mention the necessity for a TrueType console font and prefix your string with u8 (to make sure it works with default project settings), I'd accept your answer as complete. – Daniel Wolf Aug 15 '17 at 18:26
  • Done , including the C++11 touch with `u8`. – mkluwe Aug 15 '17 at 19:01
  • Thank you for your help! – Daniel Wolf Aug 15 '17 at 19:21
14

std::cout is doing exactly what it should: it's sending your UTF-8 encoded text along to the console, but your console will interpret those bytes using its current code page. You need to set your program's console to the UTF-8 code page:

#include <string>
#include <iostream>
#include <Windows.h>

int main() {
    std::string test = u8"Greek: αβγδ; German: Übergrößenträger";
    SetConsoleOutputCP(CP_UTF8);
    std::cout << test;
}

It would be great if Windows switched the default code page to UTF-8, but they likely can't due to backwards-compatibility concerns.

Miles Budnek
  • 28,216
  • 2
  • 35
  • 52
  • 2
    That doesn't seem to work for me. I built your code using Visual Studio and ran it from `cmd`. The output is exactly the same as without the call to `SetConsoleOutputCP`. – Daniel Wolf Aug 08 '17 at 19:42
  • 1
    Interesting, what version of Windows are you on, and what does the console say your current code page is after you run the program (click top left icon->properties->options)? It [prints just fine](https://i.stack.imgur.com/Q8eaT.png) for me on Windows 10. – Miles Budnek Aug 08 '17 at 19:56
  • @DanielWolf 1. What is your current console code page? Run `chcp` to find out. 2. Will it help if you run `chcp 65001` before running your program? 3. Check whether `SetConsoleOutputCP` returns nonzero and if it returns 0, call `GetLastError()` to get the error code (or add `@err,hr` to watches to see it in the debugger). – Paul Aug 08 '17 at 20:02
  • I'm on Windows 7 Ultimate SP1. The Options tab of the Properties dialog shows no code page information. `chcp` tells me I'm on code page 850 (DOS-Latin-1), both before and after running the program. `SetConsoleOutputCP` returns 1, i.e. success. – Daniel Wolf Aug 08 '17 at 20:12
  • 2
    Another thought: Remarks section for `SetConsoleOutputCP` says *However, if the current font is a raster font, SetConsoleOutputCP does not affect how extended characters are displayed.* What is your current console font? What if you try changing it to Lucida Console or Consolas? Also, have you tried running `chcp 65001` in before starting your program? I am not suggesting it as a solution to your problem, just wondering whether it changes anything. – Paul Aug 08 '17 at 20:23
  • Running `chcp 65001` before starting the program has no effect. I'm using a raster font, which is the default. When I change it to Consolas, all non-ASCII bytes are printed as question marks: "Greek: ��������; German: ��bergr����entr��ger". In this case, it doesn't matter whether I execute `chcp 65001` before or not. – Daniel Wolf Aug 08 '17 at 20:39
  • 1
    @MilesBudnek: I just tried it on my work PC (Windows 10 Pro, font: Consolas). It's not working here, either. I'm getting �'s regardless of whether I manually set the code page first. So what's different on your machine that makes it work? -- I'm using the standard `cmd.exe` console. Do you, too? Have you made any customizations to it? – Daniel Wolf Aug 09 '17 at 07:28
3

Set the console output encoding to UTF-8 using the following Windows API call:

SetConsoleOutputCP(65001);

Documentation for that function is available on Windows Dev Center.

jfroy
  • 189
  • 1
  • 7
  • Seems like Miles Budnek beat you by a few minutes. :-) Unfortunately, as I commented below his answer, calling `SetConsoleOutputCP` doesn't seem to work. – Daniel Wolf Aug 08 '17 at 19:52
  • Ah, never mind. Could using the s suffix on the string literal be of any help? http://en.cppreference.com/w/cpp/string/basic_string/operator%22%22s – jfroy Aug 08 '17 at 20:57
  • I don't think so. At the end of the day, `u8"foo"s` is just syntactic sugar for `string(u8"foo")`. – Daniel Wolf Aug 08 '17 at 21:15
  • It does work fine for me, @DanielWolf, on Win7/64, in a plain CMD console. – Sz. Jul 10 '19 at 22:55
3

Forget everything you know about the Windows console and its Unicode/UTF-8 support (or rather lack of support). This is 2020 and it's a new world. This is not a direct answer to the question above, but rather an alternative that makes much more sense now, a new way that was not possible before.

Everybody's right, the root problem is the Windows console. But there's a new player in town, and it's Windows Terminal. Install and launch Windows Terminal. Use this program:

#include <iostream>
#include <windows.h>

int main()
{
    SetConsoleOutputCP(CP_UTF8); 
    // or have your user set the console codepage: `chcp 65001`
    
    std::cout << "\"u\" with two dots on top: \xc3\xbc\n";
    std::cout << "chinese glyph for \"world\": \xe5\x80\xbc\n";
    std::cout << "smiling emoji: \xf0\x9f\x98\x80\n";
    return 0;
}

This program sends UTF-8 through a plain cout.

The output:

Unicode output in Windows Terminal

The command chcp 65001 or SetConsoleOutputCP(CP_UTF8) is required for a cmd tab in Windows Terminal, but it looks like it is not in a Powershell tab. Maybe Powershell is UTF-8 by default?

Rooting out the core issue, cmd, is now the best option in my opinion. Spread the word.

philtherobot
  • 101
  • 4
1

Since I started using the {fmt} library, all my encoding problems are gone.

A simple example of use:

#include <fmt/core.h>

int main() {
  fmt::print("Greek: αβγδ; German: Übergrößenträger\n");
}
woocom
  • 29
  • 1
  • 2
0

Some Unicode characters can't be displayed properly in a console window even if you've changed the code page, because your font does not support it. For example, you need to install a font that supports Arabic if you want to show Arabic characters.

This stackoverflow page should be helpful.

By the way, the Unicode version of console APIs (such as WriteConsoleW) won't come to the rescue, because they internally call their corresponding Windows code page version APIs (such as WriteConsoleA). Neither will std::wcout help, because it will convert wchar_t string to char string internally.

It seems that windows console window doesn't support Unicode well, I suggest you use MessageBox instead.

liuqx
  • 76
  • 5
0

I had the same problem and wrote a very small library called libpu8 for this: https://github.com/jofeu/libpu8

For windows consoles, it replaces the streambufs of cin, cout and cerr so that they accept and produce utf-8 at the front end and talk to the console in UTF-16. On non-windows operating systems, or if cin, cout, cerr are attached to files/pipes and not consoles, it does nothing. It also translates the arguments of the C++ main() function to UTF-8 on windows.

Usage Example:

#include <libpu8.h>
#include <string>
#include <fstream>
#include <windows.h>

// argv are utf-8 strings when you use main_utf8 instead of main.
// main_utf8 is a macro. On Windows, it expands to a wmain that calls
// main_utf8 with converted strings.
int main_utf8(int argc, char** argv)
{
        // this will also work on a non-Windows OS that supports utf-8 natively
        std::ofstream f(u8widen(argv[1]));
        if (!f)
        {
                // On Windows, use the "W" functions of the windows-api together
                // with u8widen and u8narrow
                MessageBoxW(0,
                        u8widen(std::string("Failed to open file ") + argv[1]).c_str(), 0, 0);
                return 1;
        }
        std::string line;
        // line will be utf-8 encoded regardless of whether cin is attached to a
        // console, or a utf-8 file or pipe.
        std::getline(std::cin, line);
        // line will be displayed correctly on a console, and will be utf-8 if
        // cout is attached to a file or pipe.
        std::cout << "You said: " << line;
        return 0;
}
umbert
  • 51
  • 5
  • There's no need for this in recent Windows 10 versions that use the new Console infrastructure and Windows Terminal. In older versions what's really needed is to set the console to the UTF8 codepage *and* select a Unicode font. – Panagiotis Kanavos Nov 17 '20 at 08:32
  • Besides, C++ supports UTF16 strings already through `u16string` and `char16_t` and can write them to streams. If you wanted to write something that works with Unicode on any platform, no matter a user's unfortunate LC_ALL choices, the best option would be to convert the input to `u16string` and work with UTF16 throughout the application. That's how Windows and most programming languages (Java, Javascript, Python 3, Go, C#, F# etc) work - strings are Unicode (even though some use UTF8 instead of UTF16) – Panagiotis Kanavos Nov 17 '20 at 08:36
  • The question was about UTF-8. And in contrast to UTF-16, this enables you to write portable code, since UTF-16 is pretty uncommon on linux/unix. Indeed, as of Windows 10 version 1903, you can select UTF-8 as the ActiveCodePage (see https://learn.microsoft.com/de-de/windows/uwp/design/globalizing/use-utf8-code-page), but that does not work if you also target older versions. – umbert Nov 17 '20 at 09:49
  • It's the other way around. Windows is used in multilingual environments for 20 years but until recently Linux/Unix was pretty uncommon in applications and countries that need Unicode, leading to problems when more than one codepage is used per machine or when an end user needs to handle files from a different locale. You'll find a *lot* of SO questions from data scientists using R that have trouble loading Chinese or Cyrillic data on their Latin1 machines – Panagiotis Kanavos Nov 17 '20 at 10:01
  • To put it another way, I'd seen more Unicode questions coming from Linux/MacOS users in the last few years than in the past 20. What made sense for a server (handle text using the single encoding in LC_ALL) doesn't work well when two or more encodings are involved – Panagiotis Kanavos Nov 17 '20 at 10:02
  • PS: the most popular platform right now is Android. And most applications there are written in Java, which uses UTF16 just like Windows. Windows comes next which is UTF16. On iOS, Swift 5 switched to UTF8 *only* unless you use opaque strings, so there are no ambiguity or assumptions there either – Panagiotis Kanavos Nov 17 '20 at 10:08