Explanation needed for an UTF-8 vs cpp case

Question

I have Microsoft Visual Studio 2010 on Windows 7 64bit. (In project properties "Character set" is set to "Not set", however every setting leads to same output.)

Source code:

  using namespace std;
  char const charTest[] = "árvíztűrő tükörfúrógép ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP\n";
  cout << charTest;
  printf(charTest);
  if(set_codepage()) // SetConsoleOutputCP(CP_UTF8); // *1
    cerr << "DEBUG: set_codepage(): OK" << endl;
  else
    cerr << "DEBUG: set_codepage(): FAIL" << endl;
  cout << charTest;
  printf(charTest);

*1: Including windows.h messes up things, so I'm including it from a separate cpp.

The compiled binary contains the string as correct UTF-8 byte sequence. If I set the console to UTF-8 with chcp 65001 and issue type main.cpp, the string displays correctly.

Test (console set to use Lucida Console font):

D:\dev\user\geometry\Debug>chcp
Active code page: 852

D:\dev\user\geometry\Debug>listProcessing.exe
├írv├şzt┼▒r┼Ĺ t├╝k├Ârf├║r├│g├ęp ├üRV├ŹZT┼░R┼É T├ťK├ľRF├ÜR├ôG├ëP
├írv├şzt┼▒r┼Ĺ t├╝k├Ârf├║r├│g├ęp ├üRV├ŹZT┼░R┼É T├ťK├ľRF├ÜR├ôG├ëP
DEBUG: set_codepage(): OK
��rv��zt��r�� t��k��rf��r��g��p ��RV��ZT��R�� T��K��RF��R��G��P
árvíztűrő tükörfúrógép ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP

What is the explanation behind that? Can I somehow ask cout to work as printf?

ATTACHMENT

Many says that Windows console does not support UTF-8 characters at all. I'm a Hungarian guy in Hungary, my Windows is set to English (except date formats, they are set to Hungarian) and Cyrillic letters are still displayed correctly alongside Hungarian letters:

Hungarian and Cyrillic letters on console at the same time

(My default console codepage is CP852)

possible duplicate of [How do I write a std::codecvt facet?](http://stackoverflow.com/questions/2971386/how-do-i-write-a-stdcodecvt-facet) — Hans Passant, Sep 22 '12 at 15:53
@HansPassant I don't believe its the same. It seems related, but does not explicitly explain the difference between the `cout` and the `printf`. And also should I write a `codecvt` facet to tell `cout` to not convert anything? There should be an easier way, I hope... — Notinlist, Sep 24 '12 at 08:37

score 4 · Accepted Answer · answered Sep 28 '12 at 09:57

4

The differences here is how C++ runtime and C library is handles system locale.

To achieve same result with std::cout you'll can try std::ios::imbue method and std::locale

But main issue with utf-8 and C++ described here

C++03 offers two kinds of string literals. The first kind, contained within double quotes, produces a null-terminated array of type const char. The second kind, defined as L"", produces a null-terminated array of type const wchar_t, where wchar_t is a wide-character. Neither literal type offers support for string literals with UTF-8, UTF-16, or any other kind of Unicode encodings.

So anyway it is all implementation specific and thus non-portable, because non of the standard C++ output streams can understand utf-8.

answered Sep 28 '12 at 09:57

Sergei Nikulov

5,029
23
36

What encoding the streams support is implementation defined. On my Linux machine a default iostream does work fine with utf8. Maybe there is some setting or some API call he can use on windows to get the same results. – Sqeaky Sep 29 '12 at 23:29
I can't wait until many C++11 implementations get those proposed string literals like u8, U, and u. I work with an international product and it would make our lives so much easier. – stinky472 Sep 30 '12 at 22:22
You may be able to find a built in locale that handles UTF-8 as seen in the example at http://en.cppreference.com/w/cpp/locale/codecvt or perhaps you can find a way to use `codecvt_byname`: http://en.cppreference.com/w/cpp/locale/codecvt_byname – Mark Ransom Oct 01 '12 at 23:07

score 2 · Answer 2 · edited May 23 '17 at 12:19

The the command line does seem to kinda work with UTF-8 for my understanding

A font capable of displaying UTF-8 characters
Set the correct Code Page in the command line (chcp 65001) not sure if this code page supports the full UTF-8 characters but it seems to be the best available

Check it out here and here

[EDIT] actually 65001 actually is UTF-8 after i checked in PowerShell

PS C:\Users\forcewill> chcp 65001
Active code page: 65001
PS C:\Users\forcewill>  [Console]::OutputEncoding


BodyName          : utf-8
EncodingName      : Unicode (UTF-8)
HeaderName        : utf-8
WebName           : utf-8
WindowsCodePage   : 1200
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
IsSingleByte      : False
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : True
CodePage          : 65001

You can use the PowerShell its much more powerful then the old cmd.exe

Edit:About using cout if we're talking in visual studio the correct answer is here a more tourough explanation can be found here about the best practices within visual studio

Thank you for supporting me in this subtopic, but the main question is about using `cout` for displaying UTF-8 sequences. — Notinlist, Oct 02 '12 at 08:24
Actually the question is also related to Visual Studio so i have updated my response to include the topic, in visual studio you should include the windows.h and define the preprocessor macro UNICODE and use the L macro to declare static strings, it is explained in the last link i have now supplied in my Awnser, — forcewill, Oct 02 '12 at 20:58
Something moves, but not smooth yet. I will resume to it tomorrow. — Notinlist, Oct 02 '12 at 21:11

score 1 · Answer 3 · answered Sep 22 '12 at 16:00

1

On Windows, single-byte strings are usually interpreted as ASCII, or some 256-character codepage. That means you won't get real unicode support.

The short answer is: use wide strings (e.g. L""árvíztűr..." - notice the L) then write to wcout instead of cout. Windows usually interprets wide (2 bytes on Windows) strings as UTF-16 (or at least a close variant), so it will work as intended. On Windows always use wide strings to avoid encoding issues.

answered Sep 22 '12 at 16:00

AshleysBrain

22,335
15
88
124

Isn't there a problem with wcout, which internally converts Unicode to CP_ACP, and then back to Unicode, so that wcout does not in fact support Unicode? – Dialecticus Sep 22 '12 at 16:35
3

It's the Windows console output that fails to work with UTF-8 (it's not a valid codepage for the console itself). The C++ layer on top of it is just failing to do the smart thing. – rubenvb Sep 24 '12 at 09:42
@rubenvb Wrong! As I said, `main.cpp` is an UTF-8 file and I can `type` it onto the screen correctly. The console is perfectly aware of UTF-8 and handling it correctly after issuing a `chcp 65001` command. I don't understand the upvotes on your comment. – Notinlist Sep 28 '12 at 11:59
@Notinlist The encoding of `main.cpp` has absolutely no influence on what the console shows. I distinctly remember `CP_UTF8` being usable only for `MultiByteToWideChar` and `WideCharToMultiByte`, but can't find references other than forum posts saying the same thing. I tried and indeed, changing the codepage works (once you set a proper font of course, which is [easy on Vista+, but needs undocumented functions on XP](http://social.msdn.microsoft.com/Forums/fi-FI/vclanguage/thread/2bffea84-e5a0-4fde-bd24-53cbcf1e3025). – rubenvb Sep 28 '12 at 12:42
The wide support that visual studio supports when compiling with the UNICODE option and when using the wide features of c++ are a 16 byte encoding called UCS2. I think it is binary compatible with UTF16 characters tat require only the first 15 bits. It is fixed length so it can't represent anything requiring those extra characters without some external information(system settings, locales or something). Also see: http://www.joelonsoftware.com/articles/Unicode.html – Sqeaky Sep 29 '12 at 23:31
@Sqeaky, Windows itself supports the full UTF-16 encoding, although support has changed over the years - see http://blogs.msdn.com/b/michkap/archive/2005/05/11/416552.aspx – Mark Ransom Oct 01 '12 at 22:45

score 1 · Answer 4 · answered Oct 01 '12 at 12:37

1

First of all windows console does not support UTF-8 (codepage 65001, in order to test this open a UTF-8 encoded file that saved with notepad in console and you will see junk data in the console), so in order to check your output you should redirect it to a file or something like that and check result from there (myapp > test.txt).

second in C/C++ char[] is a sequence of characters that can interpreted anyway that programmer want, but UTF-8 is an special protocol to encode unicode character set, so there is no way (beside C++11) that you write a sequence of characters and those character encoded in UTF8 because I will say char p[3] = "اب", but if compiler want to encode this in UTF-8 it need 5 bytes not 3. so you should use something that understand UTF-8.

I suggest using boost::locale::conv::utf_to_utf with wide string constants. for example

std::string sUTF8 = boost::locale::conv::utf_to_utf(L"árvíztűrő tükörfúrógép ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP\n");
std::cout << sUTF8; // or printf( "%s", sUTF8.c_str() );

this will ensure that you have UTF-8 string, but again don't check it with console, since it don't understand UTF-8 at all!!.

answered Oct 01 '12 at 12:37

BigBoss

6,904
2
23
38

You are wrong. Set font to `Lucida Console`, issue a `chcp 65001` and see that UTF-8 characters DO appear correctly (only the byte order mark appears as an empty rectangle). I tested it again at this moment. These steps are covered in the question. – Notinlist Oct 01 '12 at 12:58
I will try this boost function later at home. Thanks for that hint. – Notinlist Oct 01 '12 at 13:01
I have done what you say, set the font to `Lucida Console` and issue a `chcp 65001` but it only show rectangles, if you can see them it is possibly because characters that you used in your UNICODE file are all from CP_ACP( default code page of the system that can changed through control panel ). use characters from other languages like Japanese or something like that and you will see that console can't show them – BigBoss Oct 01 '12 at 13:24
You are not the first who states these things, so I dug up evidence against it and presented in the "ATTACHMENT" section of the question. Please observe and comment. – Notinlist Oct 01 '12 at 14:11
I already say it, your language is very close to Latin but if you use a language like Chinese, Japanese or a language from Middle East like Arabic then you will see that console can't show most Unicode characters and this is why Microsoft released ISE from PowerShell. But in every case if you can see what you print in the console this is very good check the result from the console but if not this is not always your fault in the programming side, output the result into a file and then check it from there. :-) – BigBoss Oct 01 '12 at 15:38
Hungarian and Cyrillic characters are not on the same Latin codepage. I placed them in ONE file, saved it as UTF-8, `type`d it onto the screen, and they appeared correctly. Lucida Console does not have far eastern characters (check with `charmap`). It would be surprising if the console could display Chinese characters correctly. Anyways, if it somehow occurs that I only can display Latin1..Latin15 characters with UTF-8 notation, I will still be happy. If I will know how to do it with `cout`. – Notinlist Oct 02 '12 at 08:22
If you can output your desired characters to the console and it can show them correctly, bingo! you are there. but the problem was (as I understand it) a UTF-8 string which can't used as `char test[] = "some unicode string"` and then expect `test` to contain UTF-8 data and I suggest `boost::locale::conv::utf_to_utf` for it. I'm from Middle East and for many years console does not support my character set. So if you are happy with what supported in the console, I am very happy for you. and +1 for Notepad++, it's my favorite – BigBoss Oct 02 '12 at 14:58
I have boost 1.47 and I don't see locale.hpp. Maybe it came later into game. (Why don't they place 'since' and 'until' information into documentations?) So I cannot try your advise. +1 – Notinlist Oct 03 '12 at 07:57
`boost::locale` is a wonderful library that originally implemented for `CPPCMS` web framework and added to boost in version `1.48.0` but `utf_to_utf` fairly have no much dependencies in other boost libraries, so from online code you can extract its code and use it even without boost::locale – BigBoss Oct 03 '12 at 09:01

Explanation needed for an UTF-8 vs cpp case

4 Answers4