Why was the type of '\xe4' chosen to be char instead of unsigned char

Question

I am learning C++ using the books listed here. In particular, I learnt that

The signedness of char depends on the compiler and the target platform

This means that on one implementation/platform char might be signed and in another it might be unsigned. In other words, we cannot portably write char ch = 228; because the system in which char is signed 228 is out of range. For example, if you see this demo you'll see that we get a warning in clang.

Then I was surprised to learn that the type of '\xe4' is char and not unsigned char. I was surprised because \xe4 corresponds to 228 which will be out of range for a system in which char is signed. So I expected the type of '\xe4' to be an unsigned char.

Thus, my question is why did the standard choose to define the type of '\xe4' to be char instead of unsigned char. I mean \xe4 is in range of unsigned char but out of range for char(in a system where char is signed). So it seems natural/intuitive to me that unsigned char should've been used as the type of '\xe4' so that it won't have platform/implementation dependence.

Note

Note that I am trying to make sense of what is happening here and my current understanding might be wrong. I was curious about this and so have asked this question to clear my concept further, as I've just started learning C++.

Note also that my question is not about whether we can portably write char ch = 228; but instead that why is the type of '\xe4' chosen to be char inplace of unsigned char.

Summary

Why is the type of a character literal char, even when the value of the literal falls outside the range of char? Wouldn't it make more sense to allow the type to be unsigned char where the value fits that range?

__Suggestion:__ You might want to start your question with the question. I suggest the following summary of your query as I understand it: *"Why is the type of a character literal `char`, even when the value of the literal falls outside the range of `char`? Wouldn't it make more sense to allow the type to be `unsigned char` when the value fits that range?"* Then you could proceed with your examples using `228` and `\xe4`. — JaMiT, Sep 17 '22 at 10:55
A summary at the end helps people verify that they understood what they spent the time to read. A summary at the beginning helps people decide if they want to spend the time reading your question. I'd suggest starting with the summary in the hope of getting more people to read the question. — JaMiT, Sep 17 '22 at 11:14
Speculatively (although consult **D&E** to see if it is mentioned) between 1979 and 1985 when Bjarne made the decision, *character literals* were expected to hold ASCII characters. Which makes the range outside of 7-bit character literal values unusual. The decision was also coupled with making character literals **char** rather than C **int**. And **char** have platform determined *signedness*. — Eljay, Sep 17 '22 at 11:32
I checked the **D&E**, and it discussed the decision to make *character literals* be **char** rather than **int** (like they are in C). But no mention of `'\xE4'` "out of range" concerns, just a more general point that there were no compatibility problems with the C and C++ difference. The reason to make *character literals* be **char** was to support overloading; and also that `unsigned` had just been added to C so it was still novel at the time. — Eljay, Sep 17 '22 at 12:27
Ah, someone found a list of duplicates. Not sure if any really address this question, though, about why there is no promotion of character literals from `char` to `unsigned char`. It is probably the result of language evolution. There is also the possibility that it is desirable for the literal `'\xe4'` to have the same type on all systems, regardless of whether or not `char` is signed on those systems. Having different types strikes me as a worse incompatibility than possibly having different values. — JaMiT, Sep 17 '22 at 12:28

JHBonarius · Answer 1 · 2022-09-19T12:11:15.750

0

By language (C++) definition. The basic character literal is a char type. Link to cppreference. It can be found in the standard under lex.ccon

But for char, basic.fundamental #7 states:

Type "char" is a distinct type that has an implementation-defined choice of “signed char” or “unsigned char” as its underlying type.

Which again says that it depends on the implementation. This is also stated in other answers, e.g. Why is 'char' signed by default in C++? (It isn't)

And think about what the impact would be if a character literal in the extended ASCII range (*normal ASCII is 7 bit) would automatically be promoted to unsigned char... that would give all sort of difficult issues. e.g. What should happen if you append an unsigned char to a string of signed char?

But in the end, how important is it what the underlying type is? It's not that char c = '\xe4' is UB: It's perfectly defined behavior, as the string literal get converted to the same char type, negative or not. In operations strings it doesn't matter that much that chars can be negative, just like stated in this answer. However, when sorting strings it will matter

edited Sep 19 '22 at 12:11

answered Sep 17 '22 at 11:15

JHBonarius

10,824
3
22
41

1

I already know that it is by language definition`char`. My question is why is it so when it doesn't make sense. Also I am not asking how to make it `unsigned char` either but rather questioning the standard's choice itself. – Alex Sep 17 '22 at 11:17
@Ronald _"why is it so when it doesn't make sense"_ which renders your question primarily opinion based. There are often, and many unlogical things in programming language designs. – πάντα ῥεῖ Sep 17 '22 at 11:23
@πάνταῥεῖ No, this is not opinion based because if `'\xe4'` is char then `char ch = '\xe4';` would be unspecified behavior for systems where char is signed. – Alex Sep 17 '22 at 11:31
1

It's not "unspecified"; it's implementation-defined. – Cody Gray - on strike Sep 17 '22 at 11:40
1

@JHBonarius You might want to look at [Why shouldn't I assume I know who downvoted my post?](https://meta.stackoverflow.com/questions/388686/why-shouldnt-i-assume-i-know-who-downvoted-my-post) – JaMiT Sep 17 '22 at 12:33

Why was the type of '\xe4' chosen to be char instead of unsigned char

Note

Summary

1 Answers1

Linked