Optimization of copying with an explicit byte order in arm-none-eabi-gcc

Question

In writing part of a deserializer for a data structure in C, I needed a way to read 16-bit and 32-bit integers. Given that there is a possibility that this code may be compiled for and used on an architecture that may not be little-endian, I decided to write helper functions that explicitly decode from little-endian byte order:

#include <stdint.h>

void read_16(uint8_t *data, uint16_t *value) {
    *value = data[0] | (data[1] << 8);
}

void read_32(uint8_t *data, uint32_t *value) {
    *value = data[0] | (data[1] << 8) | (data[2] << 16) | (data[3] << 24);
}

I was curious how this might be compiled on an architecture that is natively little-endian. arm-none-eabi-gcc with -mcpu=cortex-a9 and -Os gives the following output:

00000000 <read_16>:
   0:   e5d02001    ldrb    r2, [r0, #1]
   4:   e5d03000    ldrb    r3, [r0]
   8:   e1833402    orr r3, r3, r2, lsl #8
   c:   e1c130b0    strh    r3, [r1]
  10:   e12fff1e    bx  lr

00000014 <read_32>:
  14:   e5903000    ldr r3, [r0]
  18:   e5813000    str r3, [r1]
  1c:   e12fff1e    bx  lr

Question: Is there a reason why the optimizer would simplify to a load-then-store for 32-bit, but not for 16-bit, given that such an operation is valid, would be shorter and faster, and optimizations for size are enabled?

Specifically, I would expect the following assembly for read_16:

ldrh    r3, [r0]
strh    r3, [r1]
bx      lr

Actually, I am not quite sure why the `read_32` was optimized this way. What if `data` is not aligned to 4-byte ? — Eugene Sh., Jan 21 '20 at 16:58
Why do you use `uint8_t *data` for 1st parameter and not the same type as output? There are aliasing issues to handle with for the assembler. — Tom Kuschel, Jan 21 '20 at 17:05
@EugeneSh. I don't think that will ever happen in a sane implementation. — Marco Bonelli, Jan 21 '20 at 17:10
This code is not correct for portability. In `data[3] << 24`, the `uint8_t` `data[3]` will be converted to `int`. If the high bit of `data[3]` is set and `int` is 32-bit, the shift will overflow, resulting in behavior not defined by the C standard. All of the shifted operands should be converted to `uint32_t` (in `read_32`) or `uint16_t` (in `read_16`) before being shifted. — Eric Postpischil, Jan 21 '20 at 17:11
@MarcoBonelli: How could the compiler know that? From the source code shown, it sees a `uint8_t *` for `data` and as no information about its alignment. — Eric Postpischil, Jan 21 '20 at 17:11
@EricPostpischil Wow, did not think of that, thanks. The compiler did though apparently - adding those casts makes it optimize to the `ldrh`/`strh` version. If you write an answer based on that I'll mark it as accepted. — Nick Mertin, Jan 21 '20 at 17:18
@NickMertin I am curious, what will happen if you add `-mno-unaligned-access` to the compiler options? https://stackoverflow.com/questions/34951458/why-does-gcc-produce-illegal-unalign-accesses-for-arm-cortex-a9 — Eugene Sh., Jan 21 '20 at 17:21
@NickMertin: That was intended to be just an incidental observation, not an answer. Interestingly, using `(uint16_t)` in `read_16` [does not result in the `ldrh` improvement in GCC 8.2](https://godbolt.org), but `(uint32_t)` does. I do not see the reason for that. It might be some compiler deficiency. — Eric Postpischil, Jan 21 '20 at 17:24
@EricPostpischil interesting. I'm running on 9.2.0 from Arch repos. — Nick Mertin, Jan 21 '20 at 17:25
@EugeneSh. Adding `-mno-unaligned-access` results in neither of them being optimized; both are assembled from `ldrb` and `orr` instructions. — Nick Mertin, Jan 21 '20 at 17:27
@NickMertin Yeah, this is what I would expect... I am not sure if Arm-9 is OK with unaligned accesses. You probably could experiment with it if you have an actual hardware. — Eugene Sh., Jan 21 '20 at 17:31
the compiler should have either done optimizations on both or neither, given alignment issues you dont want code with a uint8 input to become anything but a series of ldrb's. to avoid other language issues as discussed dont overflow a variable type on shift, force the larger type then do the shifts and orrs. And as a side effect that gives the compiler more meat to chew on when optimizing as you have seen. — old_timer, Jan 21 '20 at 18:28
@EricPostpischil question about this part: "if the high bit of `data[3]` is set and `int` is 32-but, the shift will overflow"... From my understanding, this means that some of the high 24 bits of the 32-bit value are set, witch would require `data[3]` to be sign-extended to 32-bits rather than zero-extended. I'd expect it to be zero-extended, as `uint8_t` is an unsigned value (otherwise 255 would become -1, rather than 255 in 32-bits, for example). Where am I getting it wrong here? — Nick Mertin, Jan 21 '20 at 19:55
@NickMertin: C automatically promotes narrow integer types to `int`, when they are used in expressions. Any type whose values can be represented in an `int` is converted to an `int`, even if the type is unsigned. This is specified in C 2018 6.3.1.1 2. This is not a problem for the actual value being promoted, because only values that can be represented in `int` are promoted, so the value stays the same. But then, when you shift it, you are shifting in an `int`. If you shift 24 bits, you shift a set bit 7 of the original `uint8_t` into bit 31 of an `int`, which overflows a 32-bit signed `int`. — Eric Postpischil, Jan 21 '20 at 20:46
@EricPostpischil ah, so is the cause of OB that shifting into the sign bit of a signed integer type? That would make sense (I'd assume two's complement format isn't guaranteed by the C standard). — Nick Mertin, Jan 22 '20 at 21:32
@EugeneSh. slight correction: [Cortex-A9](https://en.wikipedia.org/wiki/ARM_Cortex-A) is not the same as [ARM9](https://en.wikipedia.org/wiki/ARM9). The Cortex-A9 used by OP implements the Armv7-A architecture profile, meaning that it supports unaligned access (unlike older ARM9 processors). — vgru, Feb 02 '20 at 00:06
@NickMertin: No, Eric was referring to the `read_32` function where you shift the byte 24 times. But gcc would optimize this version [in any case](https://godbolt.org/z/ezLBfz). For the 16-bit version, is apparently gets confused. — vgru, Feb 02 '20 at 00:34

score 0 · Answer 1 · answered Feb 02 '20 at 00:27

It seems to be a missed optimization opportunity - if you introduce a temporary uint32_t variable to the function:

void read_16(uint8_t *data, uint16_t *value) {
    // to the future maintainers: 'tmp' makes this function faster
    uint32_t tmp = data[0] | (data[1] << 8);
    *value = (uint16_t)tmp;
}

You ironically get the expected version at -O2 and -O3:

ldrh    r3, [r0]        @ unaligned
strh    r3, [r1]        @ movhi
bx      lr

Optimization of copying with an explicit byte order in arm-none-eabi-gcc

1 Answers1