3

What's more recommended or advisable way to convert array of uint8_t at offset i to uint64_t and why?

uint8_t * bytes = ...
uint64_t const v = ((uint64_t *)(bytes + i))[0];

or

uint64_t const v = ((uint64_t)(bytes[i+7]) << 56)
                 | ((uint64_t)(bytes[i+6]) << 48)
                 | ((uint64_t)(bytes[i+5]) << 40)
                 | ((uint64_t)(bytes[i+4]) << 32)
                 | ((uint64_t)(bytes[i+3]) << 24)
                 | ((uint64_t)(bytes[i+2]) << 16)
                 | ((uint64_t)(bytes[i+1]) << 8)
                 | ((uint64_t)(bytes[i]));
Lundin
  • 195,001
  • 40
  • 254
  • 396
err69
  • 317
  • 1
  • 7
  • 1
    With `((uint64_t*)(bytes + i))[0]` the byte-order depends on the endianness of the underlying system. With the other way you can explicitly select the wanted byte-order. – Some programmer dude Sep 23 '22 at 10:23
  • @Someprogrammerdude I forgot to include but `bytes` and system is little-endian, besides that is there any other differences? – err69 Sep 23 '22 at 10:35
  • 4
    In addition some architectures may have strict requirements about alignment of `uint64_t` pointers so reinterpreting a `uint8_t` pointer may raise an exception (some ARM architectures for example) – Jack Sep 23 '22 at 10:56
  • Convert your `whatever *` to `intptr_t` (optional type since C99) _first_ and then to `uint64_t`. – Neil Sep 23 '22 at 16:22

3 Answers3

3

There are two primary differences.

One, the behavior of ((uint64_t *)(bytes + i))[0] is not defined by the C standard (unless certain prerequisites about what bytes point to are met). Generally, an array of bytes should not be accessed using a uint64_t type.

When memory defined as one type is accessed with another type, it is called aliasing, and the C standard only defines certain combinations of aliasing. Some compilers may support some aliasing beyond what the standard requires, but using it is not portable. Additionally, if bytes + i is not suitably aligned for a uint64_t, the access may cause an exception or otherwise malfunction.

Two, loading the bytes through aliasing, if it is defined (by the standard or by compiler extension), interprets the bytes using the memory ordering for the C implementation. Some C implementations store the bytes representing integers in memory from low address to high address for low-position-value bytes to high-position-value bytes, and some store them from high address to low address. (And they can be stored in non-consecutive orders too, although this is rare.) So loading the bytes this way will produce different values from the same bytes in memory based on what order the C implementation uses.

But loading the bytes and using shifts to combine them will always produce the same value from the same bytes in memory regardless of what order the C implementation uses.

The first method should be avoided, because there is no need for it. If one desires to interpret the bytes using the C implementation’s ordering, this can be done with:

uint64_t t;
memcpy(&t, bytes+i, sizeof t);
const uint64_t v = t;

Using memcpy provides a portable way of aliasing the uint64_t to store bytes into it. Good compilers recognize this idiom and will optimize the memcpy to a load from memory, if suitable for the target architecture (and if optimization is enabled).

If one desires to interpret the bytes using little-endian ordering, as shown in the code in the question, then the second method may be used. (Sometimes platforms will have routines that may provide more efficient code for this.)

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
1

You can also use memcpy

uint64_t n;
memcpy(&n, bytes + i, sizeof(uint64_t));
const uint64_t v = n;
Martin Perry
  • 9,232
  • 8
  • 46
  • 114
  • 2
    The difference between memcpy and shifts is that in case of memcpy, the byte order of the byte array must correspond directly to the CPU endianess. In case of shifts you only need to care about what order the byte array should correspond to. So if the byte array is for example an input buffer from some serial bus, you only need to care about the network endianess when writing the shifts. In case of memcpy you need to know both the CPU and network endianess and they must also match. – Lundin Sep 23 '22 at 11:41
1

The first option has two big problems that qualify as undefined behavior (anything can happen):

  • A uint8_t* or array of uint8_t is not necessarily aligned the same way as required by a larger type like uint64_t. Simply casting to uint64_t* leads to misaligned access. This can cause hardware exceptions, program crashes, slower code etc, all depending on the alignment requirements of the specific target.

  • It violates the internal type system of C, where each object in memory known by the compiler has an "effective type" that the compiler keeps track of. Based on this, the compiler is allowed to make certain assumptions regarding if a certain memory region have been accessed or not during optimization. If your code violates these type rules, as it would in this case, wrong machine code could get generated.

    This is most commonly referred to as the strict aliasing rule and your cast followed by dereferencing would be a so-called "strict aliasing violation".

The second option is sound code, because:

  • When doing shifts or other forms of bitwise arithmetic, a large integer type should be used. That is, unsigned int or larger - depending on system. Using signed types or small integer types can lead to undefined behavior or unexpected results. See Implicit type promotion rules regarding problems with small integer types implicitly changing signedness in some expressions.

    If not for the cast to uint64_t, then the bytes[i+7] << 56 shift would involve an implicit promotion of the left operand from uint8_t to int, which would be a bug. Because if the most significant bit (MSB) of the byte is set and we shift into/beyond the sign bit, we invoke undefined behavior - again, anything can happen.

    And naturally we need to use a 64 bit type in this specific case or otherwise we wouldn't be able to shift as far as 56 bits. Shifting beyond the range of the type of the left operand is also undefined behavior.

Note that whether to pick the order of bytes[i+7] << 56 versus the alternative bytes[i+0] << 56 depends on the underlying CPU endianess. Bit shifts are nice since the actual shift ignores if the destination type is using big or little endian. But in this case you must know in advance which byte in the source array you want to correspond to the most significant. This code you have here will work if the array was built based on little endian formatting, since the last byte of the array is shifted to the highest address.

As for the uint64_t const v = , the const qualifier is a bit strange to have at local scope like that. It's harmless but confusing and doesn't really add anything of value inside a local scope. I would just drop it.

Lundin
  • 195,001
  • 40
  • 254
  • 396
  • 1
    Re “doesn't really add anything”: Declaring an object `const` prevents errors in which it is accidentally modified because of a typo. Further, it expresses intent to the reader and has zero cost in execution time. It could even enable optimizations where the compiler can rely on it not being modified by called routines. – Eric Postpischil Sep 23 '22 at 12:21
  • @EricPostpischil Not in this case, at local scope. And the argument "write more text to prevent typos" is contradicting itself. No strange tricks in the world can save someone who don't know what they are doing from themselves. It's the same muddy logic as "yoda conditions" `if(1 == a)` and other such nonsense. "I use a clever trick to prevent myself from writing a bug". Well, if you remember to always write a clever trick in every `==` statement, you might as well remember to double check that you didn't write `=` by accident. – Lundin Sep 23 '22 at 12:44
  • 1
    [Yes at local scope.](https://godbolt.org/z/b5rT918ze) When an object is defined `const`, the compiler may presume it is not changed, so the compiler eliminates the unnecessary load in the second routine. Re “someone who don't know what they are doing”: Typos occur to any human, even knowledgeable ones. Using `const` is not a “clever trick”; it is the intended purpose. – Eric Postpischil Sep 23 '22 at 22:30