0

I've been looking at the Unicode chart, and know that the first 127 code points are equivalent for almost all encoding schemes, ASCII (probably the original), UCS-2, ANSI, UTF-8, UTF-16, UTF-32 and anything else.

I wrote a loop to go through the characters starting from decimal 122, which is lowercase "z". After that there are a couple more characters such as {, |, and }. After that it gets into no-man's land which is basically around 20 "control characters", and then the characters begin again at 161 with an inverted exclamation mark, 162 which is the cent sign with a stroke through it, and so on.

The problem is, my results don't correspond the Unicode chart, UTF-8, or UCS-2 chart, the symbols seem random. By the way, the reason I made the "character variable a four-byte int was that when I was using "char" (which is essentially a one byte signed data type, after 127 it cycled back to -128, and I thought this might be messing it up.

I know I'm doing something wrong, can anyone figure out what's happening? This happens whether I set the character set to Unicode or Multibyte characters in the project settings. Here is the code you can run.

#include <iostream>

using namespace std;

int main()
{
    unsigned int character = 122; // Starting at "z"
    for (int i = 0; i < 100; i++)
    {
        cout << (char)character << endl;
        cout << "decimal code point = " << (int)character << endl;
        cout << "size of character =  " <<  sizeof(character) << endl;
        character++;
        system("pause");
        cout << endl;
    }

    return 0;
}

By the way, here is the Unicode chart

http://unicode-table.com/en/#control-character

Zebrafish
  • 11,682
  • 3
  • 43
  • 119
  • What, precisely, do you consider to be wrong? – AndyG Jan 18 '16 at 20:32
  • The font used by `cout` might depend on the locale. Where I live we used to have åäö right after z in 7-bit "ASCII". – Bo Persson Jan 18 '16 at 20:34
  • @AndyG Well for example if you assign a value 163, or U+00A3 to a data type, (let's forget 'char' data type because char is signed and cycles around to -128 after 127, but like an int or a short int, or an unsigned 8 bit data type (if they exist), and print it as a char, you should get the pound sign. No? But none of my numbers correspond to the Unicode chart, it's all just random after 127. – Zebrafish Jan 18 '16 at 20:44
  • The bytes you are printing do not get interpreted as Unicode. If I recall correctly, they get interpreted as Windows-1252. – user253751 Jan 18 '16 at 21:10
  • @immibis No, sorry, I just looked up the 1252 code page and it's nothing like what I'm getting from that program I wrote up there. Following the tilde(~) (decimal code point 126) 0x007E, I get a question mark in a box, then a 'c' with a cedilla (tail), then a 'u' with an umlaut/diaresis (two dots above it), then an e with an acute accent above it, then an 'a' with a circumflex accent above it, then an 'a' with with umlaut, then an 'e' with grave accent, then an i with umlaut, etc. This is so confusing. It's supposed to be Unicode!!!, I don't get it. – Zebrafish Jan 19 '16 at 00:00
  • @TitoneMaurice It is ***not*** supposed to be Unicode. You're not using any API functions that support Unicode ("wide characters"), you're not using wide streams, and you're printing `char`s which can't hold values higher than 255 anyway. – user253751 Jan 19 '16 at 00:03
  • @immibis Sorry, I know signed one-byte chars can only store from -128 to 127, which is why I made it an int (4 bytes), it follows the Unicode code page up to 127 and then goes haywire. If I wanted to print the British pound sign for example, U+00A3 (Dec 163), how would I do it? – Zebrafish Jan 19 '16 at 00:23
  • 1) Project settings affect which Win32 API functions get called: …A or …W. A is for ANSI, W is for UTF-16. – Tom Blodget Jan 19 '16 at 01:05
  • 2) ANSI is not one character set. It is the one (of many) that the thread is is using for the Win32 …A functions. – Tom Blodget Jan 19 '16 at 01:06
  • 3) Unicode is a character set with many encodings. One, UTF-16, is often miscalled Unicode. – Tom Blodget Jan 19 '16 at 01:07
  • 1
    What you see in your `cout` statements all depends on the font used by the console (or other output) window you're using and not the settings in your program. – 1201ProgramAlarm Jan 19 '16 at 01:16
  • @Tom Blodget Yeah I'm aware that Microsoft has created a misnomer in their use of Unicode, which is actually UTF-16 encoding. It you just run the program you'll see that the characters I get match no known character encoding scheme I can find, None of the UTFs, ANSI, or the Windows-1252. Also, as this is a console program I didn't thnk calling A or W functions mattered, I'm just calling cout, or I even tried printf(), are there A and W versions for these? Again, how do I print the British pound sign for example? (U+00A3 (decimal 163)? I'm starting to think this is an endianness problem? – Zebrafish Jan 19 '16 at 01:57
  • @1201ProgramAlarm Thanks for your help, I tried to change the fonts in the command prompt but nothing changes. – Zebrafish Jan 19 '16 at 01:59

1 Answers1

1

Very likely the bytes you're printing are displayed using the console code page (sometimes referred to as OEM), which may be different than the local single- or double-byte character set used by Windows applications (called ANSI).

For instance, on my English language Windows install ANSI means windows-1252, while a console by default uses code page 850.

There are a few ways to write arbitrary Unicode characters to the console, see How to Output Unicode Strings on the Windows Console

Community
  • 1
  • 1
roeland
  • 5,349
  • 2
  • 14
  • 28
  • You hit it on the head, thanks, bro. That's exactly the code page I get, finally found it thanks to you. Now just to go about fixing this. Thanks. – Zebrafish Jan 19 '16 at 05:18
  • @ roeland from Wikipedia: Systems largely replaced code page 850 with, firstly, Windows-1252 (often mislabeled as ISO-8859-1), and later with UCS-2, and finally with UTF-16. So why is it I'm not getting proper UTF-16? Is Microsoft or C++ to blame, I'm so frustrated. – Zebrafish Jan 19 '16 at 05:25
  • @TitoneMaurice The C++ standard doesn't specify how bytes you write to the output stream are interpreted. And Windows has supported UTF-16 for a long time (the reason UCS-2 is mentioned is because they already supported Unicode when it was still a 16-bit character set) – roeland Jan 19 '16 at 20:33