I was experimenting with UTF-8 and Qt and encountered a weird issue, so I investigated. I have created a simple program that prints bytes in const char[]
literals:
#include <cstdio>
const char* koshka = "кошка";
const char* utf8_koshka = u8"кошка";
void printhex(const char* str)
{
for (; *str; ++str)
{
printf("%02X ", *str & 0xFF);
}
puts("");
}
int main(int argc, char *argv[])
{
printhex(koshka);
printhex(utf8_koshka);
return 0;
}
If we save the file as UTF-8 with BOM, then run it from Visual Studio 2015, this will be printed:
3F 3F 3F 3F 3F
D0 BA D0 BE D1 88 D0 BA D0 B0
While I don't really understand where the first string came from, the second is exactly what is should be, according to this UTF-8 encoding table.
If the exact same code is saved as UTF-8 without BOM, this is the output:
D0 BA D0 BE D1 88 D0 BA D0 B0
C3 90 C2 BA C3 90 C2 BE C3 91 CB 86 C3 90 C2 BA C3 90 C2 B0
So while it causes the unprefixed const char[]
literal to be saved in the binary as UTF8, it breaks the u8
prefix for some reason.
If, however, we force the execution charset using #pragma execution_character_set("utf-8")
, both strings are printed as D0 BA D0 BE D1 88 D0 BA D0 B0
in both cases (UTF-8 with and without BOM).
I've used Notepad++ to convert between the encodings.
What is going on?
EDIT:
Alan's answer explains the cause of this behavior, but I'd like to add a word of warning. I've run into this issue while using Qt Creator to develop a Qt 5.5.1 application. In 5.5.1, the QString (const char*)
constructor will assume the given string is encoded as UTF-8, and so will end up calling QString::fromUtf8
to construct the object. However, Qt Creator (by default) saves every file as UTF without BOM; this causes MSVC to misinterpret the source input as MBCS, exactly what has happened in this case, so under the default settings, the following will work:
QMessageBox::information(0, "test", "кошка");
and this will fail (mojibake):
QMessageBox::information(0, "test", u8"кошка");
A solution would be to enable the BOM in Tools -> Options -> Text Editor. Note that this only applied to MSVC 2015 (or actually 14.0); older versions have less/no C++11 support, and u8
simply doesn't exist there, so if you're working with Qt on an older version, your best bet is to rely on the compiler getting confused by the lack of the BOM.