8

Given a binary file with 32-bit little-endian fields that I need to parse, I want to write parsing code that compiles correctly independent of endianness of machine that executes that code. Currently I use

uint32_t fromLittleEndian(const char* data){
  return uint32_t(data[3]) << (CHAR_BIT*3) |
         uint32_t(data[2]) << (CHAR_BIT*2) |
         uint32_t(data[1]) << CHAR_BIT |
         data[0]; 
}

this, however generate inoptimal assembly. On my machine g++ -O3 -S produces:

_Z16fromLittleEndianPKc:
.LFB4:
    .cfi_startproc
    movsbl  3(%rdi), %eax
    sall    $24, %eax
    movl    %eax, %edx
    movsbl  2(%rdi), %eax
    sall    $16, %eax
    orl %edx, %eax
    movsbl  (%rdi), %edx
    orl %edx, %eax
    movsbl  1(%rdi), %edx
    sall    $8, %edx
    orl %edx, %eax
    ret
    .cfi_endproc

why is this happening? How could I convince it to produce optimal code when compiled on little endian machines:

_Z17fromLittleEndian2PKc:
.LFB5:
    .cfi_startproc
    movl    (%rdi), %eax
    ret
    .cfi_endproc

which I have gotten by compiling:

uint32_t fromLittleEndian2(const char* data){
    return *reinterpret_cast<const uint32_t*>(data);
}

Since I know my machine is little-endian, I know that above assembly is optimal, but it will fail if compiled on big-endian machine. It also violates strict-aliasing rules, so if inlined it might produce UB even on little endian machines. Is there a valid code that will be compiled to optimal assembly if possible?

Since I expect my function to be inlined a lot, any kind of runtime endian detection is out of the question. The only alternative to writing optimal C/C++ code is to use compile time endian detection, and use templates or #defines to fall back to the inefficient code if target endian is not little-endian. This however seems to be quite difficult to be done portably.

j_kubik
  • 6,062
  • 1
  • 23
  • 42
  • 1
    You can't match `reinterpret_cast`. It isn't doing any byte reordering. If you have to dance the endian byte shuffle, you have to pay the band. – user4581301 Apr 15 '16 at 22:39
  • The thing is that if my compile-target platform is little endian then I don't need byte shuffle - compiler should also know that, but it produces byte shuffled code anyway. – j_kubik Apr 15 '16 at 22:42
  • Thing is the compiler doesn't know you're flipping endian. It just sees a bunch of shifts and ors. Would be a nice trick to have, though. Could play down at the makefile level and compile and link in the correct function, but that'll kill any inlining. – user4581301 Apr 15 '16 at 22:50
  • 2
    Given that you are parsing files won't calling something like `htonl()` be insignificant compared to the actual time you spend reading data from the `HDD`? – Galik Apr 15 '16 at 22:52
  • @Galik I could but I don't want to link my code to inet just for this little thing. My `fromLittleEndian` works well, and probably quicker than anything involving calling `hton();` and freinds. And hdd throughput is likely to be much slower, I realise that. It's just that it's bugging me that I cannot get optimal assembly - This feels like something that should have been solved ages ago ;) – j_kubik Apr 15 '16 at 22:58
  • @user4581301 The compiler doesn't need to know I am flipping endian. Given that it knows target-machne-endian, it should be able to tell that above code is equivalent to `reinterpret_cast`. Or do I expect too much from optimizer? – j_kubik Apr 15 '16 at 23:00
  • 1
    AFAICT you cannot know via templates - the only way to find out endianness is essentially by reinterpreting the data through a pointer of different type, and that's not allowed in templates. Personally, I'd just use some `#define` provided by the compilers you are willing to support (and maybe some compiler intrinsic to swap the bytes); gcc provides `__BYTE_ORDER__` and `__bswap_32`, the other compilers will have something similar. Even better, you can just use boost.Endian and delegate the problem of dealing with the various compilers to them. – Matteo Italia Apr 15 '16 at 23:01
  • 1
    I agree. A smart enough compiler should be able to generate a lovely tree, shuffle it down to it's minimum, pretty much what you have there, and then do a walk through the logic to see that nothing whatsoever happened and essentially throw the whole thing out in favor of a copy. But it doesn't look like we are there yet. – user4581301 Apr 15 '16 at 23:07
  • 3
    By the way, about the "probably quicker than `hton`": at least on gcc on Linux, [I wouldn't bet on it](https://godbolt.org/g/8ZY63O); the code generated for `htonl` is probably optimal, the one with the naive shifts - I wouldn't say so. – Matteo Italia Apr 15 '16 at 23:11
  • @j_kubik : gcc is smart enough to turn `htonl` into a single `bswap` instruction on x86. There's no beating that, no matter how much templates you try to throw at the problem. If you really want to, you can wrap `hton(x)` functions in your own ones and provide inline optimized implementations for other not-so-smart compilers. – Daniel Kamil Kozar Apr 15 '16 at 23:19
  • @MatteoItalia Nice to know - didn't expect that. The only problem is that network order is big endian. – j_kubik Apr 15 '16 at 23:26
  • @j_kubik : Then that's what `` is for (man 3 endian), though it's nonstandard. – Daniel Kamil Kozar Apr 15 '16 at 23:27
  • @DanielKamilKozar I found a nice discussion of `` on gcc mailing list: https://gcc.gnu.org/ml/gcc-help/2007-07/msg00342.html - it is a bit dated, but I don't think much has changed, and they don't even go as far as other compilers there. Since I already use autotools, I should probably just defer the problem to it. Still, it feels like going around the problem; compiler alone shold be able to deliver this info. – j_kubik Apr 15 '16 at 23:36
  • A [very similar question](http://stackoverflow.com/a/36584577/224132) was asked recently about writing endian-agnostic code that compiled non-horribly. It's fiddly to accomplish for code that gcc and clang can both optimize to `bswap` or `movbe`. The most reliable way for compilers that support GNU C seems to be to use the `htobe64` or `htole32` functions that are wrappers around `__builtin_bswap64` and similar. Portable fallbacks are possible for other compilers. – Peter Cordes Apr 17 '16 at 01:44

2 Answers2

2

short answer - use htonl - its gonna be optimized up the wazzoo

pm100
  • 48,078
  • 23
  • 82
  • 145
  • 3
    The only problem is that network order is big endian. – j_kubik Apr 16 '16 at 00:06
  • yup and htonl will know that and convert or not depending on the machine its running on – pm100 Apr 18 '16 at 16:02
  • 1
    I know that, but `htonl` and friends are always converting from/to machine endian to/from big endian (network endian). My file is by definition little-endian, and I need a fuction set converting from/to machine endian to/from little endian. There is no way I see I could use `htonl` or `ntohl` to solve my problem, except maybe to convert always to big endian and then always do some byte shuffling anyway. This is not likely to be close to optimal. – j_kubik Apr 19 '16 at 02:21
1

Various platform libraries that I know of do this by #defining macros for the endian-swapping routines, based on the value of #define BIG_ENDIAN. In the cases where the source endianness matches your target endianness, you can just:

#ifdef LITTLE_ENDIAN
    #define fromLittleEndian(x) (x)
#else
    #define fromLittleEndian(x) _actuallySwapLittle((x))
#endif

For example:

http://man7.org/linux/man-pages/man3/endian.3.html

http://fxr.watson.org/fxr/source/sys/endian.h

Mark Bessey
  • 19,598
  • 4
  • 47
  • 69
  • `` doesn't seem to be portable. See https://gcc.gnu.org/ml/gcc-help/2007-07/msg00342.html – j_kubik Apr 15 '16 at 23:55
  • You have to choose here - optimal or portable. @j_kubik has a portable version that's non-optimal. Various other answers will suggest other techniques that are more-or-less portable or optimal, but the only way to ensure that you get output that does nothing in the do-nothing case is to use the preprocessor. There are no guarantees that any given C++ compiler will recognize the do-nothing case otherwise. – Mark Bessey Apr 16 '16 at 00:24
  • I guess an answer is to attempt to detect target platform, and if unsuccessful, then use suboptimal code as fall-back. – j_kubik Apr 16 '16 at 11:33
  • 1
    The most recent versions of BOOST has an endian library, which I believe is highly portable. – Zyx 2000 Apr 16 '16 at 12:17