Character Encoding in C++ internally?

Question

If I create a string literal with the u8 prefix, does the machine code knows and says, that the corresponding value of that variable should be encoded in UTF-8?

So that no matter where I run the program, the computer knows how to encode it every time? Or does the machine code doesn't say, encode it like this and this?

Because if I encode something in normal char, and something in UTF-8 (e.g. with u8), then what is the difference and how does the computer know the encoding, if the machine code doesn't say anything about it?

No, encoding is not specified unless you add options to the string. Often you'll get whatever encoding your source code editor is using. — Mark Ransom, Oct 09 '21 at 17:18
This might be helpful: https://stackoverflow.com/a/67819605/1387438 — Marek R, Oct 09 '21 at 17:31

score 6 · Answer 1 · answered Oct 09 '21 at 17:37

6

u8"..." strings are always encoded in UTF-8, as specified in [lex.string]/1.

The encoding of "..." strings depends on the compiler (and on the source file encoding), but it shouldn't be hard to configure your IDE to save files in UTF-8, and your compiler to not touch UTF-8 in plain string literals.

In any case, the encoding is handled entirely at compile-time. In the compiled code the strings are just sequences of bytes; there is no conversion between encodings at runtime, unless you explicitly call some function that does that.

answered Oct 09 '21 at 17:37

HolyBlackCat

78,603
9
131
207

Thanks! So there is no reference to the encoding in the machine code, but the variable has only been assigned one value, right? When you run this machine code, how does the computer know that the value "30" should be encoded as UTF-8 as an example? I hope you understand what I mean, I'm a little confused. – Avva Oct 09 '21 at 17:46
@Avva I don't really understand, what do you mean by "should be encoded"? As in, how to interpret the string when printing it? – HolyBlackCat Oct 09 '21 at 17:59
Yes, sorry. Thats what i meant – Avva Oct 09 '21 at 18:19
@Avva It seems `cout` can't print `u8` strings directly. But in general, you just use the suitable function to print them. Such function could, for example, be overloaded for `char8_t` and for other character types, and the right one would be selected by overload resolution at compile-time. – HolyBlackCat Oct 09 '21 at 18:44
Thanks!! Still one question: for example, the German umlaut Ü is encoded in the ISO-8859-1 character set with the decimal value 220. In the EBCDIC character set, the same value 220 encodes the curly bracket }. How does the Program then know how to represent the right character? @HolyBlackCat – Avva Oct 09 '21 at 20:25
@Avva Again, depends on what you're doing with the characters. If you're printing them to the terminal, then they're most probably sent as plain bytes (without changing the encoding, at least by default). And it's the job of the terminal to decide how to interpret them (which encoding to use). – HolyBlackCat Oct 09 '21 at 20:50
thank you very much!! This was exactly what I needed. Helped me a lot, thanks!! – Avva Oct 09 '21 at 21:11

score 2 · Answer 2 · answered Oct 09 '21 at 18:09

2

If I create a string literal with the u8 prefix, does the machine code knows and says, that the corresponding value of that variable should be encoded in UTF-8?

Machine code knows nothing. Compiler encodes the literal into UTF-8 and generate the correct sequence of bytes.

So that no matter where I run the program, the computer knows how to encode it every time? Or does the machine code doesn't say, encode it like this and this?

The sequence of bytes is then emitted at runtime and the output device that will receive this sequence will translate it correctly if it knows how to. That means that, for example, a console that accepts UTF-8 encoding will show correct chars, if not garbage is shown.

answered Oct 09 '21 at 18:09

Jean-Baptiste Yunès

34,548
4
48
69

Thanks!! Thats what i needed, i was really confused. But one more thing: what if something has a value e.g. 220. In some Encoding it is a different character right? So some encodings have the same value but represent different characters, how does it know the right one? What the user wanted? – Avva Oct 09 '21 at 18:19
i mean something like for example, the German umlaut Ü is encoded in the ISO-8859-1 character set with the decimal value 220. In the EBCDIC character set, the same value 220 encodes the curly bracket }. – Avva Oct 09 '21 at 18:40
The device that outputs the characters must be configured with an appropriate decoding algorithm. For example, in Unix-like envs you can use the environment variable LANG to set the right alphabet encoding for any console/terminal. – Jean-Baptiste Yunès Oct 11 '21 at 07:32

score 0 · Answer 3 · answered Oct 09 '21 at 17:22

0

Yes the character will almost certainly be encoded in UTF-8 but note that the standard doesn't require char8_t to be 8-bit, just that it needs to be capable of storing UTF-8 code units so some weird C++ runtime could use 16-bit characters with only 8-bits stored in each element.

Also note that char8_t is only able to store ASCII characters, all other characters require multiple code units so need to be stored in a char8_t string/array even if they are only a single character.

answered Oct 09 '21 at 17:22

Alan Birtles

32,622
4
31
60

u8 string literals are always UTF-8 encoded. Aside from that char8_t is mostly about intent. Like `ptrdiff_t` etc. – Aykhan Hagverdili Oct 09 '21 at 17:36
@AyxanHaqverdili they might store a UTF-8 encoded string but they don't have to be 8-bit – Alan Birtles Oct 09 '21 at 20:30

Character Encoding in C++ internally?

3 Answers3