13

Why does C allow accessing object using "character type":

6.5 Expressions (C)

An object shall have its stored value accessed only by an lvalue expression that has one ofthe following types:

  • a character type.

but C++ only allows char and unsigned char?

3.10 Lvalues and rvalues (C++)

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

  • a char or unsigned char type.

Another portion of signed char hatred (quote from C++ standard):

3.9 Types (C++)

For any object (other than a base-class subobject) of trivially copyable type T, whether or not the object holds a valid value of type T, the underlying bytes making up the object can be copied into an array of char or unsigned char. If the content of the array of char or unsigned char is copied back into the object, the object shall subsequently hold its original value.

And from C standard:

6.2.6 Representations of types (C)

Values stored in non-bit-field objects of any other object type consist of n × CHAR_BIT bits, where n is the size of an object of that type, in bytes. The value may be copied into an object of type unsigned char [n] (e.g., by memcpy); the resulting set of bytes is called the object representation of the value.

I can see many people on stackoverflow saying that is because unsigned char is the only character type that guaranteed to not have padding bits, but C99 Section 6.2.6.2 Integer types says

signed char shall not have any padding bits

So what is the real reason behind this?

Community
  • 1
  • 1
  • 5
    Could you provide the _location_ that those quotes come from? – Lightness Races in Orbit Jan 17 '14 at 01:22
  • 3
    Where does the "hate" come from? I don't see a question here. – Greg Hewgill Jan 17 '14 at 01:23
  • 1
    *signed char shall not have any padding bits* - You can't quote the C standard to cover C++ behaviour. – chris Jan 17 '14 at 01:24
  • 1
    Well, signed char can't do even half of what unsigned char can do. –  Jan 17 '14 at 01:25
  • 3
    I think if the pointless C comparison were removed from the question it might be easier to read. – Lightness Races in Orbit Jan 17 '14 at 01:28
  • Chris, I know, but still why C++ forbids using signed char? And why C (just like C++) mention only unsigned char in 6.2.6 (for C) and 3.9 (C++) –  Jan 17 '14 at 01:33
  • 1
    @chris: the C++ standard covers padding (or lack of padding) in character types in 3.9.1: "For character types, all bits of the object representation participate in the value representation" – Michael Burr Jan 17 '14 at 01:38
  • @MichaelBurr, That's good to know, but in that case, quote C++. I can't determine that it applies to C++ without looking up the C++ equivalent. – chris Jan 17 '14 at 01:42
  • 1
    I don't think this has anything to do with the standard's language, but `signed char` is a pretty rare sight as opposed to a `char` type that happens to be signed, which is quite common. This discussion reminds me of a job interview I had a number of years ago at a large tech company where I got into a small argument with one of the interviewers about whether or not `signed char` was supported by C. He insisted it wasn't. I claim that my insistence otherwise - not the terrible performance I gave with a whiteboard coding of an XML parser - was the reason I didn't get the job. – Michael Burr Jan 17 '14 at 01:47
  • @MichaelBurr, Was the XML thing after that? Should've used regex. – chris Jan 17 '14 at 01:52
  • 1
    God, job interviews. I would totally melt if asked to write an XML parsed on a whiteboard. Not least because I hate XML enough that I've barely seen a document in ten years, let alone ever bothered trying to parse it, let alone written a parser for it. Jees! – Lightness Races in Orbit Jan 17 '14 at 01:54
  • @chris: regex?!? I can barely keep C and half of C++ in my head. – Michael Burr Jan 17 '14 at 01:54
  • 2
    Consider rephrasing the question, avoiding the inflammatory word "hate". – Keith Thompson Jan 17 '14 at 02:01
  • Nah, that is why I put it there –  Jan 17 '14 at 02:02
  • 3
    @Ivan: Nah, we don't care. Remove it. – Lightness Races in Orbit Jan 17 '14 at 02:05
  • 3
    @Close vote fetish club: how is this "unclear"?! – Lightness Races in Orbit Jan 17 '14 at 02:18

3 Answers3

14

Here's my take on the motivation:

On a non-twos-complement system, signed char will not be suitable for accessing the representation of an object. This is because either there are two possible signed char representations which have the same value (+0 and -0), or one representation that has no value (a trap representation). In either case, this prevents you from doing most meaningful things you might do with the representation of an object. For example, if you have a 16-bit unsigned integer 0x80ff, one or the other byte, as a signed char, is going to either trap or compare equal to 0.

Note that on such an implementation (non-twos-complement), plain char needs to be defined as an unsigned type for accessing the representations of objects via char to work correctly. While there's no explicit requirement, I see this as a requirement derived from other requirements in the standard.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 1
    The philosophy of the authors of the Standard seem to be that there's no reason to mandate that *all* implementations do something if there may be some implementations where it would impose costs while offering zero benefit. If a feature or guarantee would have huge benefits on some platforms but not on others, the lack of a mandate should not prevent implementations from supporting it on platforms where it makes sense. The idea that implementations should only support useful features and guarantees when they are mandated by the Standard seems to be largely a 21st-century invention. – supercat Aug 13 '16 at 19:35
  • @supercat have to blame all those people writing code for multiple platforms, who dare to desire meanings be consistent. – Caleth Jul 12 '18 at 09:02
  • 1
    @Caleth: If the Standard wants to help people who desire consistent meanings, it should offer some types that work that way. For example, I'd like to see "uwrapN_t" types which, on platforms that support them, must behave as unsigned types that *do not promote*. When fed to operators, they should yield their own type if possible, and a force a compilation error if not. Given `uwrap16_t a=0xC001;`, the value `a*a` would be `(uwrap16_t)0x8001` whether `int` is 16 bits, 32 bits, or something else. Unfortunately, C has no such types. – supercat Jul 12 '18 at 20:13
7

I think what you're really asking is why signed char is disqualified from all the rules allowing type-punning to char* as a special case. To be honest, I don't know, especially since — as far as I can tell — signed char cannot have padding either:

[C++11: 3.9.1/1]: [..] A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.11); that is, they have the same object representation. For character types, all bits of the object representation participate in the value representation. [..]

Empirical evidence suggests that it's not much more than convention:

  • char is seen as a byte of ASCII;
  • unsigned char is seen as a byte with arbitrary "binary" content; and
  • signed char is left flapping in the wind.

To me, it doesn't seem like enough of a reason to exclude it from these standard rules, but I honestly can't find any evidence to the contrary. I'm going to put it down to a mildly inexplicable oddity in the standard wording.

(It may be that we have to ask the std-discussion list about this.)

Community
  • 1
  • 1
Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • 2
    `char` is a byte of whatever character set is in use; the standard does not, as far as I know, express a preference for ASCII over EBCDIC. (BTW, on EBCDIC-based systems, `char` must be unsigned (assuming it's 8 bits).) As for `signed char`, it's simply a very small signed integer type, guaranteed to be able to hold value from -127 to +127. – Keith Thompson Jan 17 '14 at 01:42
  • @Keith: Well, I suppose that's enough since unsigned<->signed conversion is technically implementation-defined or whatever in one direction (I forget which) and we conventionally interpret these "binary" bytes as unsigned values. Right? – Lightness Races in Orbit Jan 17 '14 at 01:49
  • 10
    @PeterGibson: Standards do not guarantee two's complement. – Siyuan Ren Jan 17 '14 at 01:52
  • 1
    @LightnessRacesinOrbit: Conversion in either direction is well defined if the result can be represented in both types. Conversion to unsigned is well defined with wraparound (modulo TYPE_MAX + 1) semantics. Conversion to signed, when the value is out of range, yields an implementation-defined result. – Keith Thompson Jan 17 '14 at 02:00
  • @Keith: That's the rule I meant, yes. – Lightness Races in Orbit Jan 17 '14 at 02:05
  • 6
    On a non-twos-complement system, `signed char` will not be suitable for accessing the representation of an object. Such an implementation would certainly define plain `char` as an unsigned type. This is probably the motivation. – R.. GitHub STOP HELPING ICE Jan 17 '14 at 03:45
  • @R..: is that because a zero that's read through a `signed char` might not be the same as a zero that's written (ie., a silent change from a negative zero to non-negative zero)? – Michael Burr Jan 17 '14 at 06:18
  • 4
    @MichaelBurr: It's because either there are two possible `signed char` representations which have the same value, or one representation that has no value (is a trap representation). In either case, this prevents you from doing most meaningful things you might do with the representation of an object. For example, if you have a 16-bit unsigned integer 0x80ff, one or the other byte, as a `signed char`, is going to either trap or compare equal to 0. – R.. GitHub STOP HELPING ICE Jan 17 '14 at 07:28
  • 1
    @R..: that should be posted as an answer. – Michael Burr Jan 17 '14 at 08:05
6

The use of a character type to inspect the representations of objects is a hack. However, it is historical, and some accommodation must be made to allow it.

Mostly, in programming languages, we want strong typing. Something that is a float should be accessed as a float and not as an int. This has a number of benefits, including reducing human errors and enabling various optimizations.

However, there are times when it is necessary to access or modify the bytes of an object. In C, this was done through character types. C++ continues that tradition, but it improves the situation slightly by eliminating the use of signed char for these purposes.

Ideally, it might have been better to create a new type, say byte, and to allow byte access to object representations only through this type, thus separating the regular character types only for use as normal integers/characters. Perhaps it was thought there was too much existing code using char and unsigned char to support such a change. However, I have never seen signed char used to access the representation of an object, so it was safe to exclude it.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • This is really promising, but can you explain _why_ `signed char` is different? – Lightness Races in Orbit Jan 17 '14 at 01:50
  • Please check section **3.9 Types** for C++ and **6.2.6 Representations of types** for C. Both standards mention only unsigned char here, so C excludes signed char too. And somewhy they both exclude plain char –  Jan 17 '14 at 01:51
  • @Ivan: C is not pure in its exclusion of `signed char`. 6.5 7 allows accessing the representation of an object with “a character type”. 6.5 6 allows copying an object as “an array of character type”. – Eric Postpischil Jan 17 '14 at 01:58
  • @LightnessRacesinOrbit: Not with enough certainty and references to put it into the answer at this time, but `char` was the default and so was widely used, and `unsigned char` was used because `char` was pesky in implementations where it is signed and therefore interferes with shifts and such, but there is little reason to choose to use `signed char` to fiddle with the bytes of an object representation, so not much code does that. – Eric Postpischil Jan 17 '14 at 02:31