C++ Visual Studio Unicode confusion

Question

I've been looking at the Unicode chart, and know that the first 127 code points are equivalent for almost all encoding schemes, ASCII (probably the original), UCS-2, ANSI, UTF-8, UTF-16, UTF-32 and anything else.

I wrote a loop to go through the characters starting from decimal 122, which is lowercase "z". After that there are a couple more characters such as {, |, and }. After that it gets into no-man's land which is basically around 20 "control characters", and then the characters begin again at 161 with an inverted exclamation mark, 162 which is the cent sign with a stroke through it, and so on.

The problem is, my results don't correspond the Unicode chart, UTF-8, or UCS-2 chart, the symbols seem random. By the way, the reason I made the "character variable a four-byte int was that when I was using "char" (which is essentially a one byte signed data type, after 127 it cycled back to -128, and I thought this might be messing it up.

I know I'm doing something wrong, can anyone figure out what's happening? This happens whether I set the character set to Unicode or Multibyte characters in the project settings. Here is the code you can run.

#include <iostream>

using namespace std;

int main()
{
    unsigned int character = 122; // Starting at "z"
    for (int i = 0; i < 100; i++)
    {
        cout << (char)character << endl;
        cout << "decimal code point = " << (int)character << endl;
        cout << "size of character =  " <<  sizeof(character) << endl;
        character++;
        system("pause");
        cout << endl;
    }

    return 0;
}

By the way, here is the Unicode chart

http://unicode-table.com/en/#control-character

The font used by `cout` might depend on the locale. Where I live we used to have åäö right after z in 7-bit "ASCII". — Bo Persson, Jan 18 '16 at 20:34
@AndyG Well for example if you assign a value 163, or U+00A3 to a data type, (let's forget 'char' data type because char is signed and cycles around to -128 after 127, but like an int or a short int, or an unsigned 8 bit data type (if they exist), and print it as a char, you should get the pound sign. No? But none of my numbers correspond to the Unicode chart, it's all just random after 127. — Zebrafish, Jan 18 '16 at 20:44
The bytes you are printing do not get interpreted as Unicode. If I recall correctly, they get interpreted as Windows-1252. — user253751, Jan 18 '16 at 21:10
@immibis No, sorry, I just looked up the 1252 code page and it's nothing like what I'm getting from that program I wrote up there. Following the tilde(~) (decimal code point 126) 0x007E, I get a question mark in a box, then a 'c' with a cedilla (tail), then a 'u' with an umlaut/diaresis (two dots above it), then an e with an acute accent above it, then an 'a' with a circumflex accent above it, then an 'a' with with umlaut, then an 'e' with grave accent, then an i with umlaut, etc. This is so confusing. It's supposed to be Unicode!!!, I don't get it. — Zebrafish, Jan 19 '16 at 00:00
@TitoneMaurice It is ***not*** supposed to be Unicode. You're not using any API functions that support Unicode ("wide characters"), you're not using wide streams, and you're printing `char`s which can't hold values higher than 255 anyway. — user253751, Jan 19 '16 at 00:03
@immibis Sorry, I know signed one-byte chars can only store from -128 to 127, which is why I made it an int (4 bytes), it follows the Unicode code page up to 127 and then goes haywire. If I wanted to print the British pound sign for example, U+00A3 (Dec 163), how would I do it? — Zebrafish, Jan 19 '16 at 00:23
1) Project settings affect which Win32 API functions get called: …A or …W. A is for ANSI, W is for UTF-16. — Tom Blodget, Jan 19 '16 at 01:05
2) ANSI is not one character set. It is the one (of many) that the thread is is using for the Win32 …A functions. — Tom Blodget, Jan 19 '16 at 01:06
3) Unicode is a character set with many encodings. One, UTF-16, is often miscalled Unicode. — Tom Blodget, Jan 19 '16 at 01:07
What you see in your `cout` statements all depends on the font used by the console (or other output) window you're using and not the settings in your program. — 1201ProgramAlarm, Jan 19 '16 at 01:16
@Tom Blodget Yeah I'm aware that Microsoft has created a misnomer in their use of Unicode, which is actually UTF-16 encoding. It you just run the program you'll see that the characters I get match no known character encoding scheme I can find, None of the UTFs, ANSI, or the Windows-1252. Also, as this is a console program I didn't thnk calling A or W functions mattered, I'm just calling cout, or I even tried printf(), are there A and W versions for these? Again, how do I print the British pound sign for example? (U+00A3 (decimal 163)? I'm starting to think this is an endianness problem? — Zebrafish, Jan 19 '16 at 01:57
@1201ProgramAlarm Thanks for your help, I tried to change the fonts in the command prompt but nothing changes. — Zebrafish, Jan 19 '16 at 01:59

score 1 · Accepted Answer · edited May 23 '17 at 12:23

1

Very likely the bytes you're printing are displayed using the console code page (sometimes referred to as OEM), which may be different than the local single- or double-byte character set used by Windows applications (called ANSI).

For instance, on my English language Windows install ANSI means windows-1252, while a console by default uses code page 850.

There are a few ways to write arbitrary Unicode characters to the console, see How to Output Unicode Strings on the Windows Console

edited May 23 '17 at 12:23

Community

1
1

answered Jan 19 '16 at 03:30

roeland

5,349
2
14
28

You hit it on the head, thanks, bro. That's exactly the code page I get, finally found it thanks to you. Now just to go about fixing this. Thanks. – Zebrafish Jan 19 '16 at 05:18
@ roeland from Wikipedia: Systems largely replaced code page 850 with, firstly, Windows-1252 (often mislabeled as ISO-8859-1), and later with UCS-2, and finally with UTF-16. So why is it I'm not getting proper UTF-16? Is Microsoft or C++ to blame, I'm so frustrated. – Zebrafish Jan 19 '16 at 05:25
@TitoneMaurice The C++ standard doesn't specify how bytes you write to the output stream are interpreted. And Windows has supported UTF-16 for a long time (the reason UCS-2 is mentioned is because they already supported Unicode when it was still a 16-bit character set) – roeland Jan 19 '16 at 20:33

C++ Visual Studio Unicode confusion

1 Answers1