Why does ((unsigned char)0x80) << 24 get sign extended to 0xFFFFFFFF80000000 (64-bit)?

Question

The following program

#include <inttypes.h> /*  printf(" %" PRIu32 "\n"), my_uint32_t) */
#include <stdio.h> /* printf(), perror() */

int main(int argc, char *argv[])
{
  uint64_t u64 = ((unsigned char)0x80) << 24;
  printf("%"  PRIX64 "\n", u64);

  /* uint64_t */ u64 = ((unsigned int)0x80)  << 24;
  printf("%016"  PRIX64 "\n", u64);
}

produces

FFFFFFFF80000000
0000000080000000

What is the difference between ((unsigned char)0x80) and ((unsigned int)0x80) in this context?

I guess that (unsigned char)0x80 gets promoted to (unsigned char)0xFFFFFFFFFFFFFF80 and then is bit shifted, but why does this conversion think that unsigned char is signed?

It's also interesting to note that 0x80 << 16 produces the expected result, 0x0000000000800000.

Because shift promotes types to integer: http://stackoverflow.com/a/22734721/2709018 — myaut, Apr 09 '15 at 12:51
I have come across this behaviour when compiling with all possible warnings & errors turned on: bit shifting produces signed values as result and it's a bitch to then get the signed/unsigned comparisons/assignments working. — emvee, Apr 09 '15 at 12:57
Note: Should code run on a system with 16-bit `int/unsigned`, a shift of 24 is undefined behavior. (16-bit common in embedded systems) Better to use `u64 = ((uint32_t)0x80) << 24` or `u64 = ((uint64_t)0x80) << 24` — chux - Reinstate Monica, Apr 09 '15 at 14:42

score 30 · Answer 1 · answered Apr 09 '15 at 12:54

30

C compiler performs integer promotions before executing the shift.

Rule 6.3.1.1 of the standard says:

If an int can represent all values of the original type, the value is converted to an int; otherwise, it is converted to an unsigned int. These are called the integer promotions.

Since all values of unsigned char can be represented by int, 0x80 gets converted to a signed int. The same is not true about unsigned int: some of its values cannot be represented as an int, so it remains unsigned int after applying integer promotions.

answered Apr 09 '15 at 12:54

Sergey Kalinichenko

714,442
84
1,110
1,523

1

@SergeBallesta It's clear from the title that OP knows about sign extension, so I decided not to re-explain it to him. – Sergey Kalinichenko Apr 09 '15 at 13:03
11

@SergeBallesta `(unsigned char) 0x80` is not converted to `0xFFFFFF80`. It is `((unsigned char)0x80) << 24` that yields an `int` with value `(int) 0x80000000` and then when converted to `uint64_t` the sign extension occurs. – ouah Apr 09 '15 at 13:12
@ouah Thanks for that explanation, I was confused by this as well. – mshildt Apr 09 '15 at 13:14
6

Note, the `(unsigned char)0x80` can be a bit of a red herring, because it also happens with `((unsigned char)0x40) << 25`. – mwfearnley Apr 09 '15 at 13:57

score 23 · Accepted Answer · edited May 23 '17 at 10:27

The left operand of the << operator undergoes integer promotion.

(C99, 6.5.7p3) "The integer promotions are performed on each of the operands."

It means this expression:

 ((unsigned char)0x80) << 24

is equivalent to:

 ((int) (unsigned char)0x80) << 24

equivalent to:

  0x80 << 24

which set the sign bit of an int in a 32-bit int system. Then when 0x80 << 24 is converted to uint64_t in the u64 declaration the sign extension occurs to yield the value 0xFFFFFFFF80000000.

EDIT:

Note that as Matt McNabb correctly added in the comments, technically 0x80 << 24 invokes undefined behavior in C as the result is not representable in the type of the << left operand. If you are using gcc, the current compiler version guarantees that it does not currently make this operation undefined.

`0x80 << 24` causes undefined behaviour (although a common result of that is to produce `INT_MIN`) — M.M, May 20 '15 at 22:34

Marian · Answer 3 · 2015-04-09T13:24:44.893

The strange part of the conversion happens when converting result of << from int32 to uint64. You are working on a 32 bit system, so the size of integer type is 32 bits. The following code:

 u64 = ((int) 0x80) << 24;
 printf("%llx\n", u64);

prints:

 FFFFFFFF80000000

because (0x80 << 24) gives 0x8000000 which is a 32 bit representation of -2147483648. This number is converted to 64 bits by multiplying sign bit and it gives 0xFFFFFFFF80000000.

score 6 · Answer 4 · answered Apr 09 '15 at 21:34

What you're witnessing is undefined behavior. C99 §6.5.7/4 describes shifting left like this:

The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1 × 2^E2, reduced modulo one more than the maximum value representable in the result type. If E1 has a signed type and nonnegative value, and E1 × 2^E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.

In your case, E1 has the value 128, and its type is int, not unsigned char. As other answers have mentioned, the value gets promoted to int prior to evaluation. The operands involved are signed int, and the value of 128 shifted left 24 places is 2147483648, which is one more than the maximum value representable by int on your system. Therefore, the behavior of your program is undefined.

To avoid this, you could make sure the type of E1 is unsigned int by type-casting to that instead of to unsigned char.

Note: Casting `0x80` to `unsigned` helps but only because `unsigned` is 32-bit on OP's system and the shift of the 8-bit value is 24. To avoid other UB, should the shift become 25 or constant become 0x100, or run on a 16-bit system, might as well cast to the target type of `uint64_t`, then shift. Using `unsigned` like `0x80u` rather than `signed` constants helps too. — chux - Reinstate Monica, Apr 09 '15 at 21:58

score 4 · Answer 5 · answered Apr 09 '15 at 16:43

One major difficulty with the evolution of the C standard is that by the time efforts were made to standardize the language, there were not only implementations that did certain things differently from each other, but there was a significant body of code written for those implementations which relied upon those behavioral differences. Because the creators of the C standard wanted to avoid forbidding implementations from behaving in ways which users of those implementations might rely upon, certain parts of the C standard are a real mess. Some of the worst aspects concern aspects of integer promotion such as the one you've observed.

Conceptually, it would seem that it would make more sense to have unsigned char should promote to unsigned int than to signed int, at least when used as anything other than the right-hand operand of the - operator. Combinations of other operators may yield large results, but there's no way any operator other than - could yield a negative result. To see why signed int was chosen despite the fact that the result can't be negative, consider the following:

int i1; unsigned char b1,b2; unsigned int u1; long l1,l2,l3;

l1 = i1+u1;
l2 = i1+b1;
l3 = i1+(b1+b2);

There's no mechanism in C by which an operation between two different types could yield a type which isn't one of the originals, so the first statement must perform the addition as signed or unsigned; unsigned generally yields slightly less surprising results, especially given that integer literals are by default signed (it would be very weird if adding 1 rather than 1u to an unsigned value could make it negative). It would be surprising, however, to have the third statement could turn a negative value of i1 into a large unsigned number. Having the first statement above yield an unsigned result but the third statement yield a signed result implies that (b1+b2) must be signed.

IMHO, the "right" way to resolve signedness-related issues would be to define separate numeric types which had documented "wrapping" behavior (like present unsigned types do), and versus those that should behave as whole numbers, and have the two kinds of types exhibit different promotion rules. Implementations would have to keep supporting existing behavior for code using existing types, but new types could implement rules which were designed to favor usability over compatibility.

Older versions of the C language didn't actually require binary machines. There was some strange verbage in the areas about overflow that only made sense if you remembered there were decimal machines in the early days. With C99 that verbage is gone and unsigned wraps. — Joshua, Apr 09 '15 at 19:50
@Joshua: In the early days of C, the behavior of unsigned values was driven largely by "have the compiler do whatever's cheapest"; in the early days, that behavior conveniently happened to be useful. Nowadays, wrapping behavior is often *not* the cheapest. On many ARM platforms, for example, given `uint16_t x;`, the statement `x++;` may require three instructions of `x` is stored in a register, but if wrapping behavior were not required it would only take one. — supercat, Apr 09 '15 at 20:15
@Joshua: BTW, I know base-10 machines existed, but when was C ever run on them? What did bitwise operators do? I can't think of any way they could behave on a decimal system which would maintain the expected relationship among `&`, `|`, and `^` and also guarantee that for any representable values `x` and `y`, `x ^ y` would also be representable. — supercat, Apr 09 '15 at 20:18
I don't know if C actually ran on them. If it did, bitwise operations were expensive. — Joshua, Apr 09 '15 at 20:21
@Joshua: The only way I can see a standards-compliant C implementation running on a decimal machine would be if storage locations were aggregated into groups that were restricted to holding powers of two (e.g. one could define each group of ten decimal digits as an `int` that could hold values from 0 to 8589934591). Unfortunately, the relations between powers of two and powers of ten would work out somewhat awkwardly; one could say that each such group could be interpreted as three `char` values 0 to 2047, but one couldn't split interpret the digits in groups of 5... — supercat, Apr 09 '15 at 21:07
...unless the maximum value that could be represented by ten decimal digits' worth of storage was reduced from 8589934591 to 4294967295 (in which case the groups of five would hold values 0 to 65535). — supercat, Apr 09 '15 at 21:08

Why does ((unsigned char)0x80) << 24 get sign extended to 0xFFFFFFFF80000000 (64-bit)?

5 Answers5