Console can be set to display UTF-8 chars: @vladasimovic answers SetConsoleOutputCP(CP_UTF8)
can be used for that. Alternatively, you can prepare your console by DOS command chcp 65001
or by system call system("chcp 65001 > nul")
in the main program. Don't forget to save the source code in UTF-8 as well.
To check the UTF-8 support, run
#include <stdio.h>
#include <windows.h>
BOOL CALLBACK showCPs(LPTSTR cp) {
puts(cp);
return true;
}
int main() {
EnumSystemCodePages(showCPs,CP_SUPPORTED);
}
65001
should appear in the list.
Windows console uses OEM codepages by default and most default raster fonts support only national characters. Windows XP and newer also supports TrueType fonts, which should display missing chars (@Devenec suggests Lucida Console in his answer).
Why printf fails
As @bames53 points in his answer, Windows console is not a stream device, you need to write all bytes of multibyte character. Sometimes printf
messes the job, putting the bytes to output buffer one by one. Try use sprintf
and then puts
the result, or force to fflush only accumulated output buffer.
If everything fails
Note the UTF-8 format: one character is displayed as 1-5 bytes. Use this function to shift to next character in the string:
const char* ucshift(const char* str, int len=1) {
for(int i=0; i<len; ++i) {
if(*str==0) return str;
if(*str<0) {
unsigned char c = *str;
while((c<<=1)&128) ++str;
}
++str;
}
return str;
}
...and this function to transform the bytes into unicode number:
int ucchar(const char* str) {
if(!(*str&128)) return *str;
unsigned char c = *str, bytes = 0;
while((c<<=1)&128) ++bytes;
int result = 0;
for(int i=bytes; i>0; --i) result|= (*(str+i)&127)<<(6*(bytes-i));
int mask = 1;
for(int i=bytes; i<6; ++i) mask<<= 1, mask|= 1;
result|= (*str&mask)<<(6*bytes);
return result;
}
Then you can try to use some wild/ancient/non-standard winAPI function like MultiByteToWideChar (don't forget to call setlocale()
before!)
or you can use your own mapping from Unicode table to your active working codepage. Example:
int main() {
system("chcp 65001 > nul");
char str[] = "příšerně"; // file saved in UTF-8
for(const char* p=str; *p!=0; p=ucshift(p)) {
int c = ucchar(p);
if(c<128) printf("%c\n",c);
else printf("%d\n",c);
}
}
This should print
p
345
237
353
e
r
n
283
If your codepage doesn't support that Czech interpunction, you could map 345=>r, 237=>i, 353=>s, 283=>e. There are at least 5(!) different charsets just for Czech. To display readable characters on different Windows locale is a horror.