c++ literal u8 and BOM (Byte Order Mask)

Question

I decided to write a simple example:

#include <iostream>

int main()
{
    std::cout << u8"это строка6" << std::endl;
    return 0;
}

Executed in the console the following command: chcp 65001

Programm output:

��то строка6

Why is the first character is not displayed correctly? I think that the codepage 65001 uses BOM, and read first symbol as BOM. Is this true?

The title mentions a BOM, the question doesn't. Where is the hidden link? — IInspectable, Jul 07 '15 at 08:11
Yes, my console support unicode characters. Also only a firt symbol displays incorrectly. — andrei.aliashkevich, Jul 07 '15 at 08:20
I mean that each string that could be decoded by codepage 65001 must contains BOM as first symbol. — andrei.aliashkevich, Jul 07 '15 at 09:03

score 2 · Answer 1 · edited May 23 '17 at 10:26

Well the entire standard IO library is dodgy with that code page. Here's another test program (\xe2\x86\x92 is the arrow → in UTF-8):

#include <stdio.h>

int main(void)
{
    char s[] = "\xe2\x86\x92 a \xe2\x86\x92 b\n";
    int l = (int) sizeof(s) - 1;
    int wr = fwrite(s, 1, l, stdout);
    printf("%d/%d written\n", wr, l);
    return 0;
}

And its output:

��� a → b
10/12 written

Note that the first character is again replaced by the �� (it's 3 bytes in UTF-8), and the fwrite call returns the number of characters written on the console. This is a violation of the C standard (it should return the number of bytes), and it will break every program using fwrite or related functions correctly (for instance, try to print "☺☺☺☺☺☺☺☺☺☺☺☺" with Python 3.4).

So your only options to reliably output Unicode text are Windows-specific (unless these issues are fixed in the latest version of MSVC):

Use wide output functions, as described here: Output unicode strings in Windows console app
Use WriteConsoleW (the wide version). Make sure you test if the standard output or error handle is actually a console.

c++ literal u8 and BOM (Byte Order Mask)

1 Answers1