5

So, I want to be able to use Chinese characters in my C++ program, and I need to use some type, to hold such characters beyond the ASCII range.

However, I tried to run the following code, and it worked.

    #include <iostream>

    int main() {
      char snet[4];
      snet[0] = '你';
      snet[1] = '爱';
      snet[2] = '我';
      std::cout << snet << std::endl;
      int conv = static_cast<int>(snet[0]);
      std::cout << conv << std::endl; // -96
    }

This doesn't make sense, as since a sizeof(char) in C++, for the g++ compiler evaluates to 1, yet Chinese characters cannot be expressed in a single byte.

Why are the Chinese characters here being allowed to be housed in a char type?

What type should be used to house Chinese characters or non-ASCII characters in C++?

Josh Weinstein
  • 2,788
  • 2
  • 21
  • 38
  • 2
    `sizeof(char)` *must* be 1, even if the size of a `char` is multiple bytes on the machine. – Justin Jan 12 '18 at 07:07
  • But then what happens to those extra bytes? Is it overflowing? – Josh Weinstein Jan 12 '18 at 07:09
  • No. It's just a quirk in C++. The standard says it must be 1, so `sizeof(char)` is 1. The unit wouldn't be bytes in the normal sense of the word then – Justin Jan 12 '18 at 07:11
  • 2
    @Justin the unit of `sizeof` is a byte i.e. smallest addressable unit of memory. `char` is a single byte on all systems. – eerorika Jan 12 '18 at 07:12
  • I will suggest you go through this link once http://www.cplusplus.com/forum/windows/11802/ – Jai Prak Jan 12 '18 at 07:19
  • @JaiPrak Informative, but does not explain why non ASCII characters are allowed in the `char` type, or `char` arrays. – Josh Weinstein Jan 12 '18 at 07:28
  • An ASCII character in 8-bit ASCII encoding is 8 bits (1 byte), though it can fit in 7 bits. A Unicode character in UTF-8 encoding is between 8 bits (1 byte) and 32 bits (4 bytes). Chinese characters are bigger in size than ASCII chars. – Jai Prak Jan 12 '18 at 07:34
  • 2
    You may get [multicharacter literal warnings](https://stackoverflow.com/q/3960954/995714), and [`sizeof(char)` is always 1](https://stackoverflow.com/q/2215445/995714) [regardless of how many bits are there in a `char`](https://stackoverflow.com/q/2098149/995714) – phuclv Jan 12 '18 at 08:46
  • @JaiPrak: 8 bits is an octet. In C++, a byte is not necessarily an octet. – MSalters Jan 12 '18 at 09:27
  • By "it worked", do you mean the characters were printed, or simply that the code compiled? – chris Jan 12 '18 at 18:08

1 Answers1

4

When you compile the code using -Wall flag you will see warnings like:

warning: overflow in implicit constant conversion [-Woverflow] snet[2] = '我';

warning: multi-character character constant [-Wmultichar] snet1 = '爱';

Visual C++ in Debug mode, gives the following warning:

c:\users\you\temp.cpp(9): warning C4566: character represented by universal-character-name '\u4F60' cannot be represented in the current code page (1252)

What is happening under the curtains is that your two byte Chinese characters are implicitly converted to a char. That conversion overflows and therefore you are seeing a negative value or something weird when you print it in the console.

Why are the Chinese characters here being allowed to be housed in a char type?

You can, but you shouldn't, the same way that you can define char c = 1000000;

What type should be used to house Chinese characters or non-ASCII characters in C++?

If you want to store Chinese characters and you can use C++11, go for UTF-8 encoding with std::string (live example).

std::string msg = u8"你爱我"; 
Jive Dadson
  • 16,680
  • 9
  • 52
  • 65
FrankS101
  • 2,112
  • 6
  • 26
  • 40
  • 1
    *you are seeing a negative value or something weird when you print it in the console* - According to the question, the program worked when ran. The negative value of a `char` just means that `char` is signed in the OP's system and the value wasn't in the positive signed char range. – chris Jan 12 '18 at 08:23