Why do character arrays accept non ASCII characters in C++?

Question

So, I want to be able to use Chinese characters in my C++ program, and I need to use some type, to hold such characters beyond the ASCII range.

However, I tried to run the following code, and it worked.

    #include <iostream>

    int main() {
      char snet[4];
      snet[0] = '你';
      snet[1] = '爱';
      snet[2] = '我';
      std::cout << snet << std::endl;
      int conv = static_cast<int>(snet[0]);
      std::cout << conv << std::endl; // -96
    }

This doesn't make sense, as since a sizeof(char) in C++, for the g++ compiler evaluates to 1, yet Chinese characters cannot be expressed in a single byte.

Why are the Chinese characters here being allowed to be housed in a char type?

What type should be used to house Chinese characters or non-ASCII characters in C++?

`sizeof(char)` *must* be 1, even if the size of a `char` is multiple bytes on the machine. — Justin, Jan 12 '18 at 07:07
But then what happens to those extra bytes? Is it overflowing? — Josh Weinstein, Jan 12 '18 at 07:09
No. It's just a quirk in C++. The standard says it must be 1, so `sizeof(char)` is 1. The unit wouldn't be bytes in the normal sense of the word then — Justin, Jan 12 '18 at 07:11
@Justin the unit of `sizeof` is a byte i.e. smallest addressable unit of memory. `char` is a single byte on all systems. — eerorika, Jan 12 '18 at 07:12
I will suggest you go through this link once http://www.cplusplus.com/forum/windows/11802/ — Jai Prak, Jan 12 '18 at 07:19
@JaiPrak Informative, but does not explain why non ASCII characters are allowed in the `char` type, or `char` arrays. — Josh Weinstein, Jan 12 '18 at 07:28
An ASCII character in 8-bit ASCII encoding is 8 bits (1 byte), though it can fit in 7 bits. A Unicode character in UTF-8 encoding is between 8 bits (1 byte) and 32 bits (4 bytes). Chinese characters are bigger in size than ASCII chars. — Jai Prak, Jan 12 '18 at 07:34
You may get [multicharacter literal warnings](https://stackoverflow.com/q/3960954/995714), and [`sizeof(char)` is always 1](https://stackoverflow.com/q/2215445/995714) [regardless of how many bits are there in a `char`](https://stackoverflow.com/q/2098149/995714) — phuclv, Jan 12 '18 at 08:46
@JaiPrak: 8 bits is an octet. In C++, a byte is not necessarily an octet. — MSalters, Jan 12 '18 at 09:27
By "it worked", do you mean the characters were printed, or simply that the code compiled? — chris, Jan 12 '18 at 18:08

score 4 · Accepted Answer · edited Jan 12 '18 at 09:05

When you compile the code using -Wall flag you will see warnings like:

warning: overflow in implicit constant conversion [-Woverflow] snet[2] = '我';

warning: multi-character character constant [-Wmultichar] snet1 = '爱';

Visual C++ in Debug mode, gives the following warning:

c:\users\you\temp.cpp(9): warning C4566: character represented by universal-character-name '\u4F60' cannot be represented in the current code page (1252)

What is happening under the curtains is that your two byte Chinese characters are implicitly converted to a char. That conversion overflows and therefore you are seeing a negative value or something weird when you print it in the console.

Why are the Chinese characters here being allowed to be housed in a char type?

You can, but you shouldn't, the same way that you can define char c = 1000000;

What type should be used to house Chinese characters or non-ASCII characters in C++?

If you want to store Chinese characters and you can use C++11, go for UTF-8 encoding with std::string (live example).

std::string msg = u8"你爱我";

*you are seeing a negative value or something weird when you print it in the console* - According to the question, the program worked when ran. The negative value of a `char` just means that `char` is signed in the OP's system and the value wasn't in the positive signed char range. — chris, Jan 12 '18 at 08:23

Why do character arrays accept non ASCII characters in C++?

1 Answers1

Linked