The signedness of char in g++/gcc and its history

Question

First let me start off by saying that I know char, signed char and unsigned char are different types in C++. From a quick reading of the standard, it also appears that whether char is signed is implementation-defined. And to make things just a little more fun, it appears g++ decides whether a char is signed on a per-platform basis!

So anyway with that background, let me introduce a bug I've run into using this toy program:

#include <stdio.h>

int main(int argc, char* argv[])
{
    char array[512];
    int i;
    char* aptr = array + 256;

    for(i=0; i != 512; i++) {
        array[i] = 0;
    }

    aptr[0] = 0xFF;
    aptr[-1] = -1;
    aptr[0xFF] = 1;
    printf("%d\n", aptr[aptr[0]]);
    printf("%d\n", aptr[(unsigned char)aptr[0]]);

    return 0;
}

The intended-behavior is that both calls to printf should output 1. Of course, what happens on gcc and g++ 4.6.3 running on linux/x86_64 is that the first printf outputs -1 while the second outputs 1. This is consistent with chars being signed and g++ interpreting the negative array index of -1 (which is technically undefined behavior) sensibly.

The bug seems easy enough to fix, I just need to cast the char to unsigned like shown above. What I want to know is whether this code was ever expected to work correctly on an x86 or x86_64 machines using gcc/g++? It appears this may work as intended on ARM platform where apparently chars are unsigned, but I would like know whether this code has always been buggy on x86 machines using g++?

When you say that `-1` output is buggy, do you mean "my code is buggy" or "compiler is buggy"? Also, does the compiler output a warning on assignment of `0xFF`? — anatolyg, Jan 06 '15 at 17:25
FWIW, GCC provides compiler options to force `char` to have whatever signedness you like. It exists precisely to work around non-portable buggy code like this. :) — Lightness Races in Orbit, Jan 06 '15 at 17:29
Using a negative array index is perfectly fine as long as the pointer operand is to an interior element: http://stackoverflow.com/questions/3473675/negative-array-indexes-in-c — ecatmur, Jan 06 '15 at 17:30
@Pramod Why did you decide that "both calls to printf should output 1"? — Vlad from Moscow, Jan 06 '15 at 17:36
@JoachimPileborg using a negative number in array indexing is allowed; both C (6.5.6p8) and C++ (5.7p5) use the language "an element offset from the original element", which is intended to permit negative offsets. See also the linked question and answers. — ecatmur, Jan 06 '15 at 17:48
@Joachim Pileborg Please do not say a foolish. integer literal -1 is not the same as integer literal 0xffffffff though they can have the same internal representation. — Vlad from Moscow, Jan 06 '15 at 17:48
@JoachimPileborg: Why is the value 0xFFFFFFF added to the index, rather than subtracting 1 from the index or the pointer? — Thomas Matthews, Jan 06 '15 at 17:49
Hmm... I must be getting tired or something, did't think that trhough. — Some programmer dude, Jan 06 '15 at 17:52
Presumably a better fix would be to define an array of `unsigned char`. — Keith Thompson, Jan 06 '15 at 19:42

Keith Thompson · Accepted Answer · 2015-01-06T18:29:05.780

4

I see no undefined behavior in your program. Negative array indices are not necessarily invalid, as long as the result of adding the index to the prefix refers to a valid memory location. (A negative array index is invalid (i.e., has undefined behavior) if the prefix is the name of an array object or a pointer to the 0th element of an array object, but that's not the case here.)

In this case, aptr points to element 256 of a 512-element array, so the valid indices go from -256 to +255 (+256 yields a valid address just past the end of the array, but it can't be dereferenced). Assuming CHAR_BIT==8, any of signed char, unsigned char, or plain char has a range that's a subset of the array's valid index range.

If plain char is signed, then this:

aptr[0] = 0xFF;

will implicitly convert the int value 0xFF (255) to char, and the result of that conversion is implementation-defined -- but it will be within the range of plain char, and it will almost certainly be -1. If plain char is unsigned, then it will assign the value 255 to aptr[0]. So the behavior of the code depends on the signedness of plain char (and possibly on the implementation-defined result of a conversion of an out-of-range value to a signed type), but there is no undefined behavior.

(Converting an out-of-range value to a signed type may also, starting with C99, raise an implementation-defined signal, but I know of no implementation that actually does that. Raising a signal on a conversion of 0xFF to char would probably break existing code, so compiler writers are highly motivated not to do that.)

edited Jan 06 '15 at 18:29

answered Jan 06 '15 at 17:56

Keith Thompson

254,901
44
429
631

A negative array index is perfectly valid. Where did you get the notion, of they being invalid, from? – RcnRcf Jan 06 '15 at 18:13
@RD445: Given `int arr[10];`, the expression `arr[-1]` has undefined behavior. But given `int *p = &arr[1];`, the expression `p[-1]` is valid; it refers to `arr[0]`. If you think I'm mistaken, can you clarify? – Keith Thompson Jan 06 '15 at 18:24
C allows out of bound array access since inception and thus does C++. – RcnRcf Jan 06 '15 at 18:28
1

@RD445: Depends on what you mean by "allows". An out of bounds array access has undefined behavior. (I used the word "invalid" as a shorthand for that.) – Keith Thompson Jan 06 '15 at 18:28
@RD445 an out of bounds index is illegal and causes undefined behavior. -1 is a legal index if and only if it's 'in bounds', as Keith Thompson explains. – bames53 Jan 06 '15 at 18:36
I have been working on embedded systems since many years and have worked on ARM and PPC as well as X86. Many of the buffer/register accesses are done using out of bound array access and never have seen it failed and I never expected it to fail either because I know how C implements array accesses. @ThomasMathews has demonstrated the same in his answer. Based on the same negative indices are perfectly valid. – RcnRcf Jan 06 '15 at 18:37
@RD445 negative indexes can be valid. They are valid if and only if they are 'in bounds'. Otherwise they produce undefined behavior. No amount of experience erases the paragraph from the language specifcations which defines that to be the case (e.g. See the C++ spec, clause 5.7 [expr.add] paragraph 5). If it's worked for you in the past you've simply gotten (un)lucky. – bames53 Jan 06 '15 at 18:46
@bames53 No out of bound indices produce any "undefined" behavior. You never seem to have worked on something like that and thus you have no idea of what you are talking about. And is it possible that I get lucky all the time and all the 100s of colleagues I have worked with also got lucky all the time? I will have a look at the standard and get back to you on that. – RcnRcf Jan 06 '15 at 18:53
@bames53 I read the paragraph and I believe you are referring to the last statement from the same "If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.". In this can you tell what is the "result" and the "evaluation" referring to and are you able to relate it to the rest of the paragraph? I mean are you sure you are able to understand the copied statement in its context. If you do then I would like to see it explained clearly. – RcnRcf Jan 06 '15 at 19:16
@RD445: One of the infinitely many results of undefined behavior is code that appears to "work". Given `int arr[10];`, evaluating `arr[-1]` has undefined behavior. Knowing how a particular C compiler implements array indexing does not change that. (If you think it has defined behavior, you should be able to explain exactly how the behavior is defined *by the C standard*.) – Keith Thompson Jan 06 '15 at 19:22
@bames53: Be careful with the word "illegal". The C standard doesn't use that term. Some constructs are syntax errors or constraint violations, requiring a compile-time diagnostic. Others have undefined behavior, which doesn't require a diagnostic. (A `#error` directive that survives conditional compilation requires the translation unit to be rejected; that's pretty much the only case.) – Keith Thompson Jan 06 '15 at 19:23
@RD445 In that paragraph, "result" refers to the pointer value produced by adding the pointer operand and the offset operand together. "evaluation" refers the computation of that result. It might be unclear why I referred you to this paragraph on pointer addition, but it's because the behavior of array indexing is elsewhere (clause 5.2.1) defined in terms of pointer addition. – bames53 Jan 06 '15 at 19:26
@RD445: In my previous comment, replace "C standard" by "C++ standard", since the question is tagged C++ (though the code is also valid C). I believe the rules in this area are essentially identical in C and C++. – Keith Thompson Jan 06 '15 at 19:28
@KeithThompson Sure. By 'illegal' I just mean 'cannot produce a well defined execution under the spec.' I don't think it's a counter-intuitive use. – bames53 Jan 06 '15 at 19:32

score 1 · Answer 2 · answered Jan 06 '15 at 17:57

The type of an array has nothing to do with the indexes (except for underlying memory access).

For example:

signed int a[25];
unsigned int b[25];

int value = a[-1];
unsigned int u_value = b[-5];

The indexing formula for both cases is:

memory_address = starting_address_of_array
               + index * sizeof(array_type);

As far as char goes, it's size is 1 regardless (by definition of the language specifications).

The usage of char in arithmetic expressions may depend on whether it is signed or unsigned.

This does not sound like an answer to the question. – RcnRcf Jan 06 '15 at 18:11 — RcnRcf, Jan 06 '15 at 18:11

RcnRcf · Answer 3 · 2015-01-06T20:14:44.990

0

The intended-behavior is that both calls to printf should output 1

Are you sure?

The rvalue of aptr[0] is a signed char and is -1, which is again used to index in to aptr[] and thus what you get is -1 for the first printf().

The same goes for the second printf but there, using a type cast you ensure that it is interpreted as an unsigned char, thus you end up with 255, and using it to index in to aptr[] you get 1 from the second printf().

I believe your assumption about the expected behavior is incorrect.

Edit 1:

It appears this may work as intended on ARM platform where apparently chars are unsigned, but I would like know whether this code has always been buggy on x86 machines using g++?

Based on this statement it seems that you know that char on x86 is signed (as against what some people assume what you assumed). As such the explanation that I provided should be good i.e. considering char as signed char on x86.

Edit 2:

Using a negative array index is perfectly fine as long as the pointer operand is to an interior element: stackoverflow.com/questions/3473675/negative-array-indexes-in-c – ecatmur

This is one of the comments to the question by @ecatmur. Which clarifies that a negative index is fine as against what some people think.

edited Jan 06 '15 at 20:14

answered Jan 06 '15 at 17:53

RcnRcf

356
1
8

Down voters! do a favor. Give a reason behind the down vote in the comments so that others can understand your reasoning behind the same. – RcnRcf Jan 06 '15 at 18:01
According to the text of the question the intent is for `aptr[0]` to be a `char` with the value 255, not -1. So, yes, he's sure that the intended output is "1". – bames53 Jan 06 '15 at 18:33
1

You are assuming that `char` is signed (it is not necessarily so, and the OP certainly didn't assume that) and further that the conversion of 255 (0xFF) to (signed) `char` yields -1 (which, again, is not necessarily true). – T.C. Jan 06 '15 at 18:35
@bames53 you seriously need to practice :) writing some test code will also help. The bit pattern for 255 and -1 is the same. Even the Windows calculator's programmer mode can do the trick. Try switching between Hex and Dex and Word and Byte mode and see it for your self. – RcnRcf Jan 06 '15 at 18:45
@T.C. can you demonstrate what you have mentioned? – RcnRcf Jan 06 '15 at 18:47
@RD445 You're misunderstanding my point. The code expects that `char` is not signed. Yes, the OP is sure that the code intends for `char` to be unsigned, so yes he's sure that the code's intent is for `aptr[(char)255]` to behave the same as `aptr[255]` and not `aptr[-1]`. – bames53 Jan 06 '15 at 19:06
@RD445 as for demonstrating that a conversion of 255 to a signed `char` does not necessarily result in the value -1: The C++ spec, clause 4.7 [conv.int] paragraph 3 reads "If the destination type is signed, the value is unchanged if it can be represented in the destination type (and bit-field width); **otherwise, the value is implementation-defined**." (emp. added) – bames53 Jan 06 '15 at 19:09
@bames53 by demonstration I meant demonstration using actual code. – RcnRcf Jan 06 '15 at 19:20
@bames53 and as for the text from the standard that you copied: the text mentions type "and" width. in case of 255 the width is fine but the type is not and thus is converted. So there you go, you get -1 reading/writing aptr[0]. – RcnRcf Jan 06 '15 at 19:23
@RD445 the width mentioned is for bit-fields, which doesn't apply here. So the text states that if 255 can be represented by a `char`, then the result of the conversion is a `char` with the value 255. In our case a `char` cannot represent the value 255 so the value which results from the type conversion is _implementation-defined_. Thus under the standard it is legal for an implementation to produce, say, the value 6 when converting the value 255 to a `char` value. This is what T.C. meant when he said the result of the conversion is not necessarily -1. – bames53 Jan 06 '15 at 19:43
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/68296/discussion-between-rd445-and-bames53). – RcnRcf Jan 06 '15 at 19:54

bames53 · Answer 4 · 2015-01-06T18:49:08.673

Your printf statements are the same as:

printf("%d\n", aptr[(char)255]);
printf("%d\n", aptr[(unsigned char)(char)255]);

And thus obviously depends on the platform's behavior for these conversions.

What I want to know is whether this code was ever expected to work correctly on an x86 or x86_64 machines using gcc/g++?

Taking 'correctly' to mean the behavior you describe, no, this should never have been expected to behave that way on a platform where char is signed.

When char is signed (and cannot represent 255) you get a value that is implementation defined and within the representable range. For an 8-bit, two's-complement representation that means you get some value in the range [-128, 127]. That means that the only possible outputs for:

printf("%d\n", aptr[(char)255]);

are "0" and "-1" (ignoring cases where printf fails). The common implementation defined conversion results in printing "-1".

The code is well defined but not portable between implementations that define different char signedness. Writing portable code includes not depending on char being signed or unsigned, which in turn means you should only use char values as array indices if the indices are limited to the range [0, 127].

The signedness of char in g++/gcc and its history

4 Answers4