How can I read a signed integer from a buffer of uint8_t without invoking un- or implementation-defined behaviour?

Question

Here's a simple function that tries to do read a generic twos-complement integer from a big-endian buffer, where we'll assume std::is_signed_v<INT_T>:

template<typename INT_T>
INT_T read_big_endian(uint8_t const *data) {
    INT_T result = 0;
    for (size_t i = 0; i < sizeof(INT_T); i++) {
        result <<= 8;
        result |= *data;
        data++;
    }
    return result;
}

Unfortunately, this is undefined behaviour, as the last <<= shifts into the sign bit.

So now we try the following:

template<typename INT_T>
INT_T read_big_endian(uint8_t const *data) {
    std::make_unsigned_t<INT_T> result = 0;
    for (size_t i = 0; i < sizeof(INT_T); i++) {
        result <<= 8;
        result |= *data;
        data++;
    }
    return static_cast<INT_T>(result);
}

But we're now invoking implementation-defined behaviour in the static_cast, converting from unsigned to signed.

How can I do this while staying in the "well-defined" realm?

Take the first function, strip and save the sign bit when `i==0` and apply it later. — Mark Ransom, Oct 12 '17 at 02:37
You must wait to C++20 http://en.cppreference.com/w/cpp/types/endian — Mateusz Drost, Oct 12 '17 at 02:46
@MateuszDrost: My question isn't about endianness, it's about signed number representation — Eric, Oct 12 '17 at 05:29
@MateuszDrost Even if it were, there are workarounds that are well-defined (if you at least assume the implementation is either big or little endian, which that proposal does). — Daniel H, Oct 12 '17 at 15:37
[this question](https://stackoverflow.com/q/13150449/102441) also seems relevant — Eric, Oct 12 '17 at 23:51

score 3 · Answer 1 · answered Oct 12 '17 at 15:37

3

Start by assembling bytes into an unsigned value. Unless you need to assemble groups of 9 or more octets, a conforming C99 implementation is guaranteed to have such a type that is large enough to hold them all (a C89 implementation would be guaranteed to have an unsigned type large enough to hold at least four).

In most cases, where you want to convert a sequence of octets to a number, you'll know how many octets you're expecting. If data is encoded as 4 bytes, you should use four bytes regardless of the sizes of int and long (a portable function should return type long).

unsigned long octets_to_unsigned32_little_endian(unsigned char *p)
{
  return p[0] | 
    ((unsigned)p[1]<<8) |
    ((unsigned long)p[2]<<16) |
    ((unsigned long)p[3]<<24);
}
long octets_to_signed32_little_endian(unsigned char *p)
{
  unsigned long as_unsigned = octets_to_unsigned32_little_endian(p);
  if (as_unsigned < 0x80000000)
    return as_unsigned;
  else
    return (long)(as_unsigned^0x80000000UL)-0x40000000L-0x40000000L;
}

Note that the subtraction is done as two parts, each within the range of a signed long, to allow for the possibility of systems where LNG_MIN is -2147483647. Attempting to convert byte sequence {0,0,0,0x80} on such a system may yield Undefined Behavior [since it would compute the value -2147483648] but the code should process in fully portable fashion all values which would be within the range of "long".

answered Oct 12 '17 at 15:37

supercat

77,689
9
166
211

Any chance you could template this in terms of the integer size, for completeness? – Eric Oct 13 '17 at 02:34
@MateuszDrost: If the data is stored as octets, the necessary constant values will be independent of a system's particular integer sizes; the Standard guarantees that `unsigned long` will be large enough, and would guarantee the same for `long` except in the aforementioned scenario were `LNG_MIN` is -2147483647. It's probably more practical to use `long` unconditionally than to use `long long` unconditionally (which would almost always be wasteful) or make the choice of type conditional upon `LNG_MIN`. – supercat Oct 13 '17 at 17:05
@supercat: you assumed that sign is stored in one bit (and not for example entire one byte) so you hardcoded first unsigned integer unrepresentable in signed integer as `0x80000000` but you should use `static_cast(std::numeric_limits::max())+1`. You entered to implementation-defined world :P – Mateusz Drost Oct 13 '17 at 17:14
1

@MateuszDrost: The goal is to convert a sequence of octets which are in a known format *that has nothing to do with the C implementation* into a number. Even if one is using a 36-bit ones'-complement machine, the octet sequence [0,0,0,0x80] should be decoded as -2147483648. The machine might store such a value internally as [0x1DF,0x1FF,0x1FF,0x1FF] but code wouldn't care about that. What matters is how the thing that generated the sequence of octets expected them to be interpreted. – supercat Oct 13 '17 at 17:27
Thanks for the precise summary in that comment @supercat. Any reason for a choice of `long` over `int32_t`? – Eric Oct 13 '17 at 19:57
@Eric: Type `int32_t` doesn't exist on all implementations. Type `int_least32_t` is guaranteed to exist in C99, but `long` is guaranteed to be at least 32 bits, and is unlikely to be more expensive to process than the shortest type that's at least 32 bits, I think it's better to avoid using more different names for what may as well be the same type. – supercat Oct 13 '17 at 20:43

maxschlepzig · Answer 2 · 2020-03-16T20:22:38.557

Unfortunately, this is undefined behaviour, as the last <<= shifts into the sign bit.

Actually, in C++17, left-shifting a signed integer that has a negative value is undefined behavior. Left-shifting a signed integer that has a positive value into the sign bit is implementation defined behavior. See also:

² The value of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are zero-filled. If E1 has an unsigned type, the value of the result is E1 × 2**E2, reduced modulo one more than the maximum value representable in the result type. Otherwise, if E1 has a signed type and non-negative value, and E1 × 2**E2 is representable in the corresponding unsigned type of the result type, then that value, converted to the result type, is the resulting value; otherwise, the behavior is undefined.

_{(C++17 final working draft, Section 8.8 Shift operators [expr.shift], Paragraph 2, page 132 - emphasis mine)}

With C++20, shifting into the sign bit changed from implementation defined to defined behavior:

² The value of E1 << E2 is the unique value congruent to E1 × 2**E2 modulo 2**N, where N is the width of the type of the result. [Note: E1 is left-shifted E2 bit positions; vacated bits are zero-filled. — end note]

_{(C++20 latest working draft, Section 7.6.7 Shift operators [expr.shift], Paragraph 2, page 129)}

Example:

int i = 2147483647;  // here: 2**31-1 == INT_MAX, sizeof(int) = 32
int j = i << 1;      // i.e. -2

Assertion: -2 is the unique value that is congruent to 2147483647 * 2 % 2**32

Check:

        a ≡ b (mod n)      | i.e. there exists an integer k:
<=> a - b = k * n
 => -2 - 2147483647 * 2 = k * 2**32
<=> -4294967296 = k * 2**32
<=> k = -1                 | i.e. there is an integer!

The value -2 is unique because there is no other value in the domain [INT_MIN .. INT_MAX] that satisfies this congruence relation.

This is a consequence of C++20 mandating two's complement representation of signed integer types:

³ [..] For each value x of a signed integer type, the value of the corresponding unsigned integer type congruent to x modulo 2 N has the same value of corresponding bits in its value representation. ^{41) This is also known as two’s complement representation.} [..]

_{(C++20 latest working draft, Section 6.8.1 Fundamental types [basic.fundamental], Paragraph 3, page 66)}

This means that with C++20, your original example invokes defined behavior, as-is.

Additional note: not that this proves anything, but the GCC/Clang undefined behavior sanitizer (invoked with -fsanitize=undefined) only triggers when compiling this example for std <= C++17 and then only complains about the shifting of the negative value (both as expected):

#include <stdio.h>
#include <limits.h>

int main(int argc, char **argv)
{
    int i = INT_MAX - 1 + argc;
    int j = i << 1;
    int k = j << 1;

    printf("%d %d %d\n", i, j, k);

    return 0;
}

Example session (on Fedora 31):

$ g++ -std=c++17 -Wall -Og sign.cc -o sign -fsanitize=undefined
$ ./sign                                                       
sign.cc:8:15: runtime error: left shift of negative value -2
2147483647 -2 -4
$ g++ -std=c++2a -Wall -Og sign.cc -o sign -fsanitize=undefined 
$ ./sign
2147483647 -2 -4

BrockLee · Answer 3 · 2020-04-29T17:15:27.567

To propose an alternative solution, the best way to copy bits and avoid UB is through memcpy:

template<typename INT_T>
INT_T read_big_endian(uint8_t const *data) {
    std::make_unsigned_t<INT_T> tmp = 0;
    for (size_t i = 0; i < sizeof(INT_T); i++) {
        tmp <<= 8;
        tmp |= *data;
        data++;
    }
    INT_T result;
    memcpy(&result, &tmp, sizeof(tmp));
    return result;
}

With this you won't get UB from casting an unsigned to signed type, and with optomizations, this compiles to the exact same assembly as your examples.

#include <cstdint>
#include <cstring>
#include <type_traits>

template<typename INT_T>
INT_T read_big_endian(uint8_t const *data) {
    std::make_unsigned_t<INT_T> tmp = 0;
    for (std::size_t i = 0; i < sizeof(INT_T); i++) {
        tmp <<= 8;
        tmp |= *data;
        data++;
    }   
    return static_cast<INT_T>(tmp);
}

template<typename INT_T>
INT_T read_big_endian2(uint8_t const *data) {
    std::make_unsigned_t<INT_T> tmp = 0;
    for (std::size_t i = 0; i < sizeof(INT_T); i++) {
        tmp <<= 8;
        tmp |= *data;
        data++;
    }   
    INT_T res;
    memcpy(&res, &tmp, sizeof(res));
    return res;
}

// Just to manifest the template expansions.
auto read32_1(uint8_t const *data) {
  return read_big_endian<int32_t>(data);
}
auto read32_2(uint8_t const *data) {
  return read_big_endian2<int32_t>(data);
}
auto read64_1(uint8_t const *data) {
  return read_big_endian<int64_t>(data);
}
auto read64_2(uint8_t const *data) {
  return read_big_endian2<int64_t>(data);
}

Compiles with clang++ /tmp/test.cpp -std=c++17 -c -O3 to:

_Z8read32_1PKh:  # read32_1
        movl    (%rdi), %eax
        bswapl  %eax
        retq

_Z8read32_2PKh:  # read32_2
        movl    (%rdi), %eax
        bswapl  %eax
        retq

_Z8read64_1PKh:  # read64_1
        movzbl  (%rdi), %eax
        shlq    $8, %rax
        movzbl  1(%rdi), %ecx
        orq     %rax, %rcx
        shlq    $8, %rcx
        movzbl  2(%rdi), %eax
        orq     %rcx, %rax
        shlq    $8, %rax
        movzbl  3(%rdi), %ecx
        orq     %rax, %rcx
        shlq    $8, %rcx
        movzbl  4(%rdi), %eax
        orq     %rcx, %rax
        shlq    $8, %rax
        movzbl  5(%rdi), %ecx
        orq     %rax, %rcx
        shlq    $8, %rcx
        movzbl  6(%rdi), %edx
        orq     %rcx, %rdx
        shlq    $8, %rdx
        movzbl  7(%rdi), %eax
        orq     %rdx, %rax
        retq

_Z8read64_2PKh:  # read64_2
        movzbl  (%rdi), %eax
        shlq    $8, %rax
        movzbl  1(%rdi), %ecx
        orq     %rax, %rcx
        shlq    $8, %rcx
        movzbl  2(%rdi), %eax
        orq     %rcx, %rax
        shlq    $8, %rax
        movzbl  3(%rdi), %ecx
        orq     %rax, %rcx
        shlq    $8, %rcx
        movzbl  4(%rdi), %eax
        orq     %rcx, %rax
        shlq    $8, %rax
        movzbl  5(%rdi), %ecx
        orq     %rax, %rcx
        shlq    $8, %rcx
        movzbl  6(%rdi), %edx
        orq     %rcx, %rdx
        shlq    $8, %rdx
        movzbl  7(%rdi), %eax
        orq     %rdx, %rax
        retq

on x86_64-linux-gnu with clang++ v8.

Most of the time, memcpy with optimizations will compile to the exact same assembly as what you intend, but with the added benefit of no UB.

Updating for corectness: The OP correctly notes that this would still be invalid since signed int representations do not need to be two's complement (at least until C++20) and this would be implementation-defined behavior.

AFAICT, up until C++20, there doesn't actually seem to be a neat C++ way of performing bit-level operations on ints without actually knowing the bit representation of a signed int, which is implementation-defined. That being said, as long as you know your compiler will represent a C++ integral type as two's complement, then both using memcpy or the static_cast in the OP's second example should work.

Part of the major reason C++20 is exclusively representing signed ints as two's complement is because most existing compilers already represent them as two's complement. Both GCC and LLVM (and thus Clang) already internally use two's complement.

This doesn't seem entirely portable (and it's understandable if this isn't the best answer), but I would imagine that you know what compiler you'll be building your code with, so you can technically wrap this or your second example with checks to see you're using an appropriate compiler.

This invokes implementation-defined behavior, because the in-memory representation of signed integers is not required to be twos-complement (until C++17). — Eric, Apr 29 '20 at 10:29
@Eric You're right. I updated my answer to at least be not wrong. — BrockLee, Apr 29 '20 at 17:16

How can I read a signed integer from a buffer of uint8_t without invoking un- or implementation-defined behaviour?

3 Answers3

Linked