Given a binary file with 32-bit little-endian fields that I need to parse, I want to write parsing code that compiles correctly independent of endianness of machine that executes that code. Currently I use
uint32_t fromLittleEndian(const char* data){
return uint32_t(data[3]) << (CHAR_BIT*3) |
uint32_t(data[2]) << (CHAR_BIT*2) |
uint32_t(data[1]) << CHAR_BIT |
data[0];
}
this, however generate inoptimal assembly. On my machine g++ -O3 -S
produces:
_Z16fromLittleEndianPKc:
.LFB4:
.cfi_startproc
movsbl 3(%rdi), %eax
sall $24, %eax
movl %eax, %edx
movsbl 2(%rdi), %eax
sall $16, %eax
orl %edx, %eax
movsbl (%rdi), %edx
orl %edx, %eax
movsbl 1(%rdi), %edx
sall $8, %edx
orl %edx, %eax
ret
.cfi_endproc
why is this happening? How could I convince it to produce optimal code when compiled on little endian machines:
_Z17fromLittleEndian2PKc:
.LFB5:
.cfi_startproc
movl (%rdi), %eax
ret
.cfi_endproc
which I have gotten by compiling:
uint32_t fromLittleEndian2(const char* data){
return *reinterpret_cast<const uint32_t*>(data);
}
Since I know my machine is little-endian, I know that above assembly is optimal, but it will fail if compiled on big-endian machine. It also violates strict-aliasing rules, so if inlined it might produce UB even on little endian machines. Is there a valid code that will be compiled to optimal assembly if possible?
Since I expect my function to be inlined a lot, any kind of runtime endian detection is out of the question. The only alternative to writing optimal C/C++ code is to use compile time endian detection, and use template
s or #define
s to fall back to the inefficient code if target endian is not little-endian. This however seems to be quite difficult to be done portably.