2

I was experimenting with UTF-8 and Qt and encountered a weird issue, so I investigated. I have created a simple program that prints bytes in const char[] literals:

#include <cstdio>

const char* koshka = "кошка";
const char* utf8_koshka = u8"кошка";

void printhex(const char* str)
{
    for (; *str; ++str)
    {
        printf("%02X ", *str & 0xFF);
    }
    puts("");
}

int main(int argc, char *argv[])
{
    printhex(koshka);
    printhex(utf8_koshka);

    return 0;
}

If we save the file as UTF-8 with BOM, then run it from Visual Studio 2015, this will be printed:

3F 3F 3F 3F 3F
D0 BA D0 BE D1 88 D0 BA D0 B0

While I don't really understand where the first string came from, the second is exactly what is should be, according to this UTF-8 encoding table.

If the exact same code is saved as UTF-8 without BOM, this is the output:

D0 BA D0 BE D1 88 D0 BA D0 B0 
C3 90 C2 BA C3 90 C2 BE C3 91 CB 86 C3 90 C2 BA C3 90 C2 B0

So while it causes the unprefixed const char[] literal to be saved in the binary as UTF8, it breaks the u8 prefix for some reason.

If, however, we force the execution charset using #pragma execution_character_set("utf-8"), both strings are printed as D0 BA D0 BE D1 88 D0 BA D0 B0 in both cases (UTF-8 with and without BOM).

I've used Notepad++ to convert between the encodings.

What is going on?


EDIT:

Alan's answer explains the cause of this behavior, but I'd like to add a word of warning. I've run into this issue while using Qt Creator to develop a Qt 5.5.1 application. In 5.5.1, the QString (const char*) constructor will assume the given string is encoded as UTF-8, and so will end up calling QString::fromUtf8 to construct the object. However, Qt Creator (by default) saves every file as UTF without BOM; this causes MSVC to misinterpret the source input as MBCS, exactly what has happened in this case, so under the default settings, the following will work:

QMessageBox::information(0, "test", "кошка");

and this will fail (mojibake):

QMessageBox::information(0, "test", u8"кошка");

A solution would be to enable the BOM in Tools -> Options -> Text Editor. Note that this only applied to MSVC 2015 (or actually 14.0); older versions have less/no C++11 support, and u8 simply doesn't exist there, so if you're working with Qt on an older version, your best bet is to rely on the compiler getting confused by the lack of the BOM.

abergmeier
  • 13,224
  • 13
  • 64
  • 120
user4520
  • 3,401
  • 1
  • 27
  • 50
  • 3F is '?'. Which makes some sense - if the execution set is not UTF-8 then the characters are probably not representable, and ? is a common fallback character. – Alan Stokes Nov 07 '15 at 14:15
  • Have you examined the encoding of the source file for UTF-8 with and without BOM, to verify that nothing unusual is happening there? – Alan Stokes Nov 07 '15 at 14:16
  • 1
    C3 90 is the UTF-8 encoding for U+D0, and C2 BA is the UTF-8 encoding for U+BA. So in the UTF-8 without BOM case it looks like the data has been UTF-8 encoded twice. Which is weird. – Alan Stokes Nov 07 '15 at 14:20
  • @AlanStokes That was my first thought, to be honest. But this definitely shouldn't happen. Unless MS assumed that `cl.exe` would never have to deal with a BOM-less UTF8 document (VS is very aggressive in enforcing this, every time you save a document it will convert it to BOM UTF8 if it's in any other format). – user4520 Nov 07 '15 at 14:27

1 Answers1

2

The compiler doesn't know what the encoding of the file is. It attempts to guess by looking at a prefix of the input. If it sees a UTF-8 encoded BOM then it assumes it is dealing with UTF-8. In the absence of that, and of any obvious UTF-16 characters, it defaults to something else. (ISO Latin 1? Whatever the common local MBCS is?)

Without the BOM the compiler fails to determine your input is UTF-8 encoded and so assumes it isn't.

It then sees each byte of the UTF-8 encoding as a single character; for the simple literal it is copied across verbatim, and for the u8 string it is encoded as UTF-8, giving the double encoding you see.

The only solution seems to be to force the BOM; alternatively, use UTF-16 which is really what the Windows platform prefers.

See also Specification of source charset encoding in MSVC++, like gcc "-finput-charset=CharSet".

Community
  • 1
  • 1
Alan Stokes
  • 18,815
  • 3
  • 45
  • 64
  • That explains it. By the way, any idea why `#pragma execution_character_set("utf-8")` fixes the problem in both cases? It tells the compiler what encoding should be used in the binary, it has nothing to do with input interpretation as far as I understand. – user4520 Nov 07 '15 at 15:01
  • No idea, to be honest. The execution character set is actually used in the compilation process (see phase 5 in http://en.cppreference.com/w/cpp/language/translation_phases). That pragma no longer seems to be supported and I'm not clear what its semantics were, or were supposed to be. – Alan Stokes Nov 07 '15 at 16:23
  • `The compiler doesn't know what the encoding of the file is.` Why pollute our files with invisible blobs? Why not offer a command line switch that forces the expected encoding across all sources? :( – rr- Dec 19 '15 at 14:55
  • @rr I agree that would be the obvious thing, and is what other compilers do. – Alan Stokes Dec 19 '15 at 14:58
  • 1
    Without BOM nor u8 prefix, [VC will assume UTF-8 if your system locale is English](https://raymai97.github.io/myblog/msvc/2017/05/04/msvc-support-utf-8-string-literal-since-vc6.html). If your system locale is something else like Japanese, then it will assume Shift-JIS. – raymai97 May 04 '17 at 17:16
  • @raymai97 Sounds like something changed (for the better). Your link 404s for me, sadly. – Alan Stokes May 06 '17 at 16:48
  • @AlanStokes Oops, I forgot to update the link. [Try this](https://raymai97.github.io/myblog/msvc-support-utf8-string-literal-since-vc6). – raymai97 May 07 '17 at 00:01