Why can an 8-bit string literal contain multibyte characters while a vector of char cannot?

Question

I am trying to figure out why can an 8-bit char data type contain all these weird characters since they are not part of the first 256 characters table.

#include <iostream>

int main()
{
    char chars[] = "    必   西 ♠ ♬   ♭   ♮   ♯";
    std::cout << "sizeof(char): " << sizeof(char) << " byte" << std::endl;
    std::cout << chars << std::endl;
    return 0;
}

Looks rather like UTF-8 – there some characters are encoded with more than one single byte... — Aconcagua, Apr 23 '22 at 00:10
Now try `sizeof(chars)`, and see if it matches your expectations. So, if `chars` had 3 letters of the alphabet, `sizeof()` would be 4 (extra `\0`). Now, see what `sizeof(chars)` shows you, then add and substract one emoji at a time, see by how much it changes, and you can pretty much figure out the answer all by yourself. — Sam Varshavchik, Apr 23 '22 at 00:10
@SamVarshavchik Yeah, but it's not my point. I mean, I wouldn't expect that a char like this 必西 can be stored in a char 8-bit data type. The fact that it's in an array doesn't matter. — Jimmy Loyola, Apr 23 '22 at 00:19
Why would it not matter? As far as the computer is concerned, all data type are made up of single bytes, all the high language types are just ways to address more than one memory location at a time and do something with the result. — Lev M., Apr 23 '22 at 00:20
write that array to a file in binary mode and inspect it with a hex editor. Or look at the raw bytes in a debugger — pm100, Apr 23 '22 at 00:24
Of course it matter, very much, that it's in an array. After all, you realize yourself, that a single `char` won't be enough. So, you need more than one. — Sam Varshavchik, Apr 23 '22 at 00:29
@SamVarshavchik I thought it was the equivalent of writing char chars [1] = {''}; — Jimmy Loyola, Apr 23 '22 at 00:30

Remy Lebeau · Accepted Answer · 2022-04-23T00:38:12.067

9

An 8-bit char can only hold 256 values max. But Unicode has hundreds of thousands of characters. They obviously can't fit into a single char. So, they have to be encoded in such a way that they can fit into multiple chars.

Your editor/compiler is likely storing your example string in UTF-8 encoding. Non-ASCII characters in UTF-8 take up more than 1 char.

In your example, in UTF-8, sizeof(chars) would be 55+1=56 chars in size (+1 for the null terminator), even though you see only 29 "characters" (if you count the spaces), where:

= 0x20 (18x)
= 0xF0 0x9F 0x98 0x8E
= 0xF0 0x9F 0xA5 0xB8
= 0xF0 0x9F 0xA4 0xA9
= 0xF0 0x9F 0xA5 0xB3
必 = 0xE5 0xBF 0x85
西 = 0xE8 0xA5 0xBF
♠ = 0xE2 0x99 0xA0
♬ = 0xE2 0x99 0xAC
♭ = 0xE2 0x99 0xAD
♮ = 0xE2 0x99 0xAE
♯ = 0xE2 0x99 0xAF

edited Apr 23 '22 at 00:38

answered Apr 23 '22 at 00:32

Remy Lebeau

555,201
31
458
770

Sorry, I've just edited my question to better clarify it. Thanks – Jimmy Loyola Apr 23 '22 at 04:38
You have completely changed the semantics of your question into something different, and basically invalidated everything that was said earlier of the original question. I suggest you revert your edit, and post a separate question instead. But basically, the very first sentence in my answer addresses your new question about why you are getting the error. You can't represent those Unicode characters using a single `char`, but they do fit in a single `wchar_t` – Remy Lebeau Apr 23 '22 at 04:45
I'll do, this is what I was actually meaning. – Jimmy Loyola Apr 23 '22 at 04:49
Yes I know, but why can I do this in a string literal of char c[]? – Jimmy Loyola Apr 23 '22 at 04:54
@JimmyLoyola because the standard allows it, do you really need an explanation why? – Remy Lebeau Apr 23 '22 at 05:04

Why can an 8-bit string literal contain multibyte characters while a vector of char cannot?

1 Answers1