49
int main()
{
    char c = 0xff;
    bool b = 0xff == c;
    // Under most C/C++ compilers' default options, b is FALSE!!!
}

Neither the C or C++ standard specify char as signed or unsigned, it is implementation-defined.

Why does the C/C++ standard not explicitly define char as signed or unsigned for avoiding dangerous misuses like the above code?

Ed S.
  • 122,712
  • 22
  • 185
  • 265
xmllmx
  • 39,765
  • 26
  • 162
  • 323
  • 10
    There is no “C/C++ standard” standard. But the question stands for both standards. – Konrad Rudolph Mar 20 '13 at 19:39
  • 1
    Generally standards leave things explicitly undefined to be flexible for implementations to do whatever they think is appropriate (or fast) for their platform. – jamesdlin Mar 20 '13 at 19:41
  • 8
    @teppic: Incorrect. `int` is always equivalent to `signed int`; `unsigned int` is a distinct type. – Keith Thompson Mar 20 '13 at 19:42
  • The reasoning is precisely in the statement in your question: "...it is implementation-defined". Specifically in **C++11 § 3.9.1p1**, "In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined." In other words, they are very specific about it being implementation-dependent whether it is signed or unsigned, but it must be consistent on that implementation. It is specified this way to the advantages, and behest, of the implementors. – WhozCraig Mar 20 '13 at 19:45
  • @KeithThompson - I'm sure I read otherwise recently -- I'd always thought the same myself. – teppic Mar 20 '13 at 19:48
  • 3
    `// b is always FALSE!!!` <- No, it will be true on implementations where `char` is unsigned. – Daniel Fischer Mar 20 '13 at 19:49
  • @teppic: I guarantee that `int` and `signed int` are names for the same type in both C and C++. If you read otherwise, you read something that was incorrect. – Keith Thompson Mar 20 '13 at 19:50
  • 1
    @WhozCraig: Yes, but that doesn't explain *why* it's implementation-defined. – Keith Thompson Mar 20 '13 at 19:51
  • @KeithThompson - I wasn't contradicting you. I found the reference - I was skim reading this: *it is implementation-defined whether the specifier int designates the same type as signed int or the same type as unsigned int.* -- but missed the bit before that says this is for bitfields. – teppic Mar 20 '13 at 19:53
  • I don't think we can know why the standards committee decided a certain thing. Any answer is just guessing, or giving our own reasons why we would have decided that way. –  Mar 20 '13 at 19:55
  • @teppic: Yes, that's right, it's for bit fields only, I forgot about that. Which means it rarely makes sense to define a bit field as `int`. For that matter, defining a bit field as `signed int` is explicit, but rarely useful; most bit fields are unsigned. – Keith Thompson Mar 20 '13 at 19:55
  • @Sancho: Not at all. Many of the committee's decisions are well documented. – Keith Thompson Mar 20 '13 at 19:56
  • @KeithThompson Ah, I didn't know that. Thank you. Are these searchable? Where at? –  Mar 20 '13 at 19:57
  • @KeithThompson Why it *would be* implementation-dependent as opposed to standard-mandated as one or the other, is what I think you're saying. Good question. And regarding `signed int` vs `int`, the former isn't even specified as one of the five mandated integer types in C++11. Indeed only "signed char" is specifically called out. The others, “short int”, “int”, “long int”, and “long long int” are "signed" by definition as being the "signed integer types". (3.9.1p2). Now I need to scour the standard for "signed int" yuck. =P – WhozCraig Mar 20 '13 at 20:02
  • The C Programming Language says about `char` that it must be large enough to store any member of the execution character set as an integer, while also being non negative. – teppic Mar 20 '13 at 20:14
  • @WhozCraig: `signed int` is simply another name for `int`. Look up "type specifiers" in the C++ standard. On the other hand, `signed char`, `unsigned char`, and `char` are three distinct types; `char` has the same representation as one of the other two. – Keith Thompson Mar 20 '13 at 20:27
  • @KeithThompson Found it, 7.1.6.2. Thanks for the pointer. Much appreciated. – WhozCraig Mar 20 '13 at 21:15

2 Answers2

54

Historical reasons, mostly.

Expressions of type char are promoted to int in most contexts (because a lot of CPUs don't have 8-bit arithmetic operations). On some systems, sign extension is the most efficient way to do this, which argues for making plain char signed.

On the other hand, the EBCDIC character set has basic characters with the high-order bit set (i.e., characters with values of 128 or greater); on EBCDIC platforms, char pretty much has to be unsigned.

The ANSI C Rationale (for the 1989 standard) doesn't have a lot to say on the subject; section 3.1.2.5 says:

Three types of char are specified: signed, plain, and unsigned. A plain char may be represented as either signed or unsigned, depending upon the implementation, as in prior practice. The type signed char was introduced to make available a one-byte signed integer type on those systems which implement plain char as unsigned. For reasons of symmetry, the keyword signed is allowed as part of the type name of other integral types.

Going back even further, an early version of the C Reference Manual from 1975 says:

A char object may be used anywhere an int may be. In all cases the char is converted to an int by propagating its sign through the upper 8 bits of the resultant integer. This is consistent with the two’s complement representation used for both characters and integers. (However, the sign-propagation feature disappears in other implementations.)

This description is more implementation-specific than what we see in later documents, but it does acknowledge that char may be either signed or unsigned. On the "other implementations" on which "the sign-propagation disappears", the promotion of a char object to int would have zero-extended the 8-bit representation, essentially treating it as an 8-bit unsigned quantity. (The language didn't yet have the signed or unsigned keyword.)

C's immediate predecessor was a language called B. B was a typeless language, so the question of char being signed or unsigned did not apply. For more information about the early history of C, see the late Dennis Ritchie's home page, now moved here.

As for what's happening in your code (applying modern C rules):

char c = 0xff;
bool b = 0xff == c;

If plain char is unsigned, then the initialization of c sets it to (char)0xff, which compares equal to 0xff in the second line. But if plain char is signed, then 0xff (an expression of type int) is converted to char -- but since 0xff exceeds CHAR_MAX (assuming CHAR_BIT==8), the result is implementation-defined. In most implementations, the result is -1. In the comparison 0xff == c, both operands are converted to int, making it equivalent to 0xff == -1, or 255 == -1, which is of course false.

Another important thing to note is that unsigned char, signed char, and (plain) char are three distinct types. char has the same representation as either unsigned char or signed char; it's implementation-defined which one it is. (On the other hand, signed int and int are two names for the same type; unsigned int is a distinct type. (Except that, just to add to the frivolity, it's implementation-defined whether a bit field declared as plain int is signed or unsigned.))

Yes, it's all a bit of a mess, and I'm sure it would have be defined differently if C were being designed from scratch today. But each revision of the C language has had to avoid breaking (too much) existing code, and to a lesser extent existing implementations.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
  • 2
    Side Bar: AS/400 and OS/390 both take extreme advantage of the bit layout of their respective EBCDIC characters sets for radix trees implemented in the underlying hardware. Hard to get much more implementation-defined than those platforms. – WhozCraig Mar 20 '13 at 19:49
  • 1
    How does this apply to `wchar_t`? – ipc Mar 20 '13 at 19:50
  • @ipc: It doesn't apply; `wchar_t` is a distinct type. In C, `wchar_t` is a typedef defined in ``. It's an integer type, but the standard doesn't specify its signedness. In C++, it's a distinct predefined integral type with the same characteristics as one of the other integral types. – Keith Thompson Mar 20 '13 at 20:00
  • @KeithThompson: Why? `char[32|64]_t` are unsigned, having `wchar_t` implementation-defined signed does not make sense to me. – ipc Mar 20 '13 at 20:03
  • @ipc: You mean `char[16|32]_t`. Both are recent additions to both C (as typedefs in `` and C++ (as fundamental types). I agree it makes sense for character types in general to be unsigned, but when `wchar_t` was added to the language that probably wasn't as clear as it is now. `char` has implementation-defined signedness for the historical reasons I've tried to explain in my answer. When `wchar_t` was defined, the same reasons probably applied. (I don't think it was even clear that `wchar_t` would necessarily be Unicode.) – Keith Thompson Mar 21 '13 at 16:51
  • The Unicode was only a work in progress in 1989 & 1990. http://www.unicode.org/history/versionone.html There were (and are) other multibyte (as well as variable byte length) encodings in the wild at the time. Given that time it would not have made since to say wchar_t was meant to be Unicode ahead of knowing if the standardization effort would work. – Shannon Severance Mar 27 '13 at 17:05
  • ... And not providing a type to handle the existing implementations of multi-byte character sets. – Shannon Severance Mar 27 '13 at 17:05
0

char at first is meant to store characters, so whether it's signed or unsigned is not important. What really matters is how to perform maths on char efficiently. So depend on the system, the compiler will choose what's most appropriate

Prior to ARMv4, ARM had no native support for loading halfwords and signed bytes. To load a signed byte you had to LDRB then sign extend the value (LSL it up then ASR it back down). This is painful so char is unsigned by default.

why unsigned types are more efficent in arm cpu?

In fact a lot of ARM compilers still use unsigned char by default, because even if you can load a byte with sign extension on modern ARM ISAs, that instruction is still less flexible than the zero extension version

And most modern compilers also allow you to change char's signness instead of using the default setting

Community
  • 1
  • 1
phuclv
  • 37,963
  • 15
  • 156
  • 475
  • So-called character types have two uses in C: to store characters, or to access raw storage. The Standard actually focuses more on the second usage, since the Standard requires implementations to honor special guarantees about character types which are often essential when accessing raw storage, but uselessly impede optimization when working with data representing actual characters. – supercat Oct 01 '18 at 16:55