Why C++ returns wrong codes of some characters, and how to fix this?

Question

I have a simple line of code:

std::cout << std::hex << static_cast<int>('©');

This character's the Copyright Sign Emoji, its code's a9, but the app writes c2a9. The same happens to lots of Unicode characters. Another example: ™ (this's 2122) suddenly returns e284a2. Why C++ returns wrong codes of some characters, and how to fix this?

Note: I'm using Microsoft Visual Studio, a file with my code is saved in UTF-8.

[What is UTF-8?](https://stackoverflow.com/q/2241348/4641116) — Eljay, Aug 03 '22 at 22:47
@SamVarshavchik The source encoding should not actually affect this. Execution character set encoding and source character set encoding are independent and as long as the compiler is not mislead about the intended encoding of the source (and does support it), it should always translate to the execution character set encoding. — user17732522, Aug 04 '22 at 02:43

user17732522 · Answer 1 · 2022-08-03T22:31:05.040

An ordinary character literal (one without prefix) usually has type char and can store only elements of the execution character set that are representable as a single byte.

If the character is not representable in this way, the character literal is only conditionally-supported with type int and implementation-defined value. Compilers typically warn when this happens with some of the generic warning flags since it is a mistake most of the time. That might depend on what warning flags exactly you have enabled.

A byte is typically 8 bits and therefore it is impossible to store all of unicode in it. I don't know what execution character set your implementation uses, but clearly neither © nor ™ are in it.

It also seems that your implementation chose to support the non-representable character by encoding it in UTF-8 and using that as the value of the literal. You are seeing a representation of the numeric value of the UTF-8 encoding of the two characters.

If you want the numeric value of the unicode code point for the character, then you should use a character literal with U prefix, which implies that the value of the character according to UTF-32 is given with type char32_t, which is large enough to hold all unicode code points:

std::cout << std::hex << static_cast<std::uint_least32_t>(U'©');

First, the Copyright Sign Emoji has its code equal to `a9`, it exactly can be stored in one byte. This's why I used exactly this symbol as my first example. Second, why there is some non-representable character, when it's just one `char`, not a string or something? It's really strange. — Irimitlad, Aug 04 '22 at 02:27
@Irimitlad What do you mean with "_its code_"? For example in [ISO 8859-1](https://en.wikipedia.org/wiki/ISO_8859-1) encoding it does have value `0xA9`. But in e.g. [ISO 8859-2](https://en.wikipedia.org/wiki/ISO_8859-2) and [codepage 437](https://en.wikipedia.org/wiki/Codepage_437) it is not encoded. In [codepage 850](https://en.wikipedia.org/wiki/Code_page_850) it has value `0xB8` and in [UTF-8](https://en.wikipedia.org/wiki/UTF-8#Encoding) it encodes to two bytes `0xC2` and `0xA9`. A `char` has only 8 bits, but you need many more to have enough space to represent all unicode code points. — user17732522, Aug 04 '22 at 02:46

Why C++ returns wrong codes of some characters, and how to fix this?

1 Answers1

Related