9

In the Plan 9 source code I often find code like this to read serialised data from a buffer with a well-defined endianess:

#include <stdint.h>

uint32_t le32read(uint8_t buf[static 4]) {
    return (buf[0] | buf[1] << 8 | buf[2] << 16 | buf[3] << 24);
}

I expected both gcc and clang to compile this code into something as simple as this assembly on amd64:

    .global le32read
    .type le32read,@function
le32read:
    mov (%rdi),%eax
    ret
    .size le32read,.-le32read

But contrary to my expectations, neither gcc nor clang recognize this pattern and produce complex assembly with multiple shifts instead.

Is there an idiom for this kind of operation that is both portable to all C99-implementations and produces good (i.e. like the one presented above) code across implementations?

fuz
  • 88,405
  • 25
  • 200
  • 352
  • You didn't specify at what optimization level you tried compiling. Did you try -O2? – Jonathon Reinhart Aug 09 '14 at 14:37
  • @JonathonReinhart I compiled with -O3 and in case of clang I also inspected the llvm intermediate code which shows that the initialisation was not being done. – fuz Aug 09 '14 at 14:40
  • I am going to assume you meant `uint8_t buf[4]` (without the `static` which doesn’t make any sense). As a parameter an array is interpreted as a pointer, so that is exactly `uint8_t *buf`. So the compiler has to access memory via a pointer, rather that having a variable. When it comes to accessing the memory the things are trickier when it comes to the assumptions that the compiler can make, because memory can be accessed via different ways (pointer arithmetics, external ways to the compiler). Even if this particular case might seem trivial, as a general case this is impossible to do. – bolov Aug 09 '14 at 15:12
  • 1
    @bolov The `uint8_t buf[static 4]` is a new syntax that asserts the compiler that accessing up to four elements of `buf` is well-defined. I was specifically talking about what the assembly on x86 looks like because x86 allows all sort of unaligned-memory access which makes compiler optimizations much easier—the compiler does not have to make any assumptions about memory alignment in most cases. I fail to see how this optimization would be undefined behavior (on amd64). – fuz Aug 09 '14 at 15:22
  • I don’t think this has anything to do with UB. I suspect is not worth implementing an optimization algorithm that detects that N consecutive contiguous access to memory that are shifted a specific way are equivalent to accessing a N-width data on that memory. After all, there will be at most one mem access (the rest will be cached) and bit operations are some of (if not the) fastest instructions on modern architectures. – bolov Aug 09 '14 at 15:28
  • 1
    @bolov The code I linked above compiles to 20 instructions with gcc. Twenty instructions to read a single integer from memory in a well-defined endianess, something for which there are specialized instructions (i.e. `bswap`) so that it can be done in one or two instructions. Most of the time, the data I'm operating on is already in the L3 cache because it was fetched from somehwere else (i.e. network, file system, decompressor) just a few cycles ago. Even if memory dominated the runtime of these operations, they appear often enough to mandate that the compiler optimizes them correctly. – fuz Aug 09 '14 at 15:31
  • My point is that there is always a report between the complexity of an optimization algorithm (implementation and cost of running during the compilation) and the benefits that it would bring. In this case I don’t think this is a good ratio, but again I am just speculating. – bolov Aug 09 '14 at 15:34
  • @bolov I understand a point, yet "you don't need this optimization" is not a good answer to this question. – fuz Aug 09 '14 at 15:40
  • Think about it, the algorithm would have to match a pattern of memory access, shift operations and bit-boolean operations to assert that what you are doing is access a data byte-by-byte. And I don’t think this is so often encountered in code for the compiler writers to be interested to design and implement such an algorithm. And I am curious: benchmark this with a straight int32_t access and see what the speed up is. I suspect it will be far from impressive. – bolov Aug 09 '14 at 15:40
  • That’s why is not an answer, but a comment. I was just trying to find explanations as why the code is not optimized. As for a solution for your problem, just cast the pointer to `uint32_t *` if the endianness matches of course. – bolov Aug 09 '14 at 15:43
  • Doesn’t this work: `return *(uint32_t *)buff`? – bolov Aug 09 '14 at 15:44
  • if the endianness doesn’t match, I think there is no way around moving bytes around in multiple instructions. – bolov Aug 09 '14 at 15:46
  • @Casting to `uint32_t*` is not-portable, not only because of the endianess issue, but also because the `uint8_t` pointer has different alignment restrictions. Many architectures provide instructions (like `bswap` on x86) to convert between different endianesses efficiently, so why shouldn't the compiler know how to use them? – fuz Aug 09 '14 at 15:48
  • @bolov: No, that's UB. The reason you'd expect a compiler to get this right is that it's done all over the place. Also, did you read the question? He's asking about the case where the endian *does* match. – tmyklebu Aug 09 '14 at 18:39
  • @JonathonReinhart: I cannot find an optimisation level at which gcc generates decent code for this pattern. – tmyklebu Aug 09 '14 at 18:53
  • @tmyklebu While the code I provided applies to the case where endianess matches, the compiler could also easily generete decent code for the case where endianess doesn't match, as the `bswap` instruction can be used. – fuz Aug 09 '14 at 19:23
  • @FUZxxl: Yes, that's the point of the code. I'm simply berating the other commenters for wasting your time. (I don't know a way to get gcc to generate decent code for this myself, but I'm much more willing to do illegal casts in practice.) – tmyklebu Aug 09 '14 at 19:34
  • @tmyklebu I filed a [bug](http://llvm.org/bugs/show_bug.cgi?id=20605) about this optimization opportunity for clang. – fuz Aug 09 '14 at 20:25
  • @FUZxxl: You might do the same for gcc. – tmyklebu Aug 09 '14 at 20:28
  • @doniyor Could you please not alter the formatting of the assembly? It is indented correctly. Specifically, a directive is not a label and therfore does not belong into the first column. – fuz Aug 09 '14 at 20:35
  • @FUZxxl oh I am sorry. – doniyor Aug 09 '14 at 20:36
  • Post your last two paragraphs as an answer and it wins an upvote from me. – tmyklebu Aug 09 '14 at 21:03
  • @tmyklebu Ask and you shall receive. – fuz Aug 09 '14 at 21:06

3 Answers3

6

After some research, I found (with the help of the terrific people in ##c on Freenode), that gcc 5.0 will implement optimizations for the kind of pattern described above. In fact, it compiles the C source listed in my question to the exact assembly I listed below.

I haven't found similar information about clang, so I filed a bug report. As of Clang 9.0, clang recognises both the read as well as the write idiom and turns it into fast code.

fuz
  • 88,405
  • 25
  • 200
  • 352
1

If you want to guaranty a conversions between a native platform order and a defined order (order on a network for example) you can let system libraries to the work and simply use the functions of <netinet/in.h> : hton, htons, htonl and ntoh, ntohs, nthol.

But I must admit that the include file is not guaranteed : under Windows I think it is winsock.h.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • The problem with these functions is that what they do is underspecified because they use platform-specific types like `short` and `int`. I'm not even sure if they are useful for the purpose I have. – fuz Mar 11 '15 at 14:00
  • In all implementations I've found, `htons` and `ntohs` operate on `int16_t` (or `uint16_t`) and `htonl` and `ntohl` operate on `int32_t` (or `uint32_t`). After all they are used for protocols that do know when they want 16 or 32 bit values ... – Serge Ballesta Mar 11 '15 at 14:16
  • Ah, my knowledge is outdated there. Still, the standard does not prescribe how an `uint32_t` is laid out in memory; overlaying structures over buffers you get from somewhere is a questionable endeavour. – fuz Mar 11 '15 at 14:27
  • @FUZxxl: There would be no semantic difficulty with defining a standardized set of portable macros which would store a number of one type into an array of a smaller type, with the endian-ness and number of bits per destination element being defined by the macro rather than the machine's variable types [so `unsigned short dest[2]; __pack_le32us(0x12345678, dest);` would set `dest[0]` to 0x5678 and `dest[1]` to 0x1234, even if `short` was a 24-bit type]. If there were a standard, compilers which made the macros identify compiler intrinsics could then optimize them easily. – supercat Jul 21 '15 at 00:01
-2

You could determine endianess like in this answer. Then use the O32_HOST_ORDER macro to decide whether to cast the byte array to an uint32_t directly or to use your bit shifting expression.

#include <stdint.h>

uint32_t le32read(uint8_t buf[static 4]) {
    if (O32_HOST_ORDER == O32_LITTLE_ENDIAN) {
        return *(uint32_t *)&buf[0];
    }
    return (buf[0] | buf[1] << 8 | buf[2] << 16 | buf[3] << 24);
}
Community
  • 1
  • 1
Tobias Brandt
  • 3,393
  • 18
  • 28
  • This is not portable as I cannot guarantee that the alignment of buf is equal to the alignment the particular platform requires for an `uint32_t`. Some platforms (like MIPS and older ARMs) do not allow you to make unaligned memory accesses. Thus, it's the compiler's job to optimize reads like the one I want to perfom—something only the compiler can do as only it knows about the platforms alignment restrictions. – fuz Aug 09 '14 at 20:41
  • 5
    "not portable" is a very forgiving word for it. This is actually **undefined behavior** because it violates the strict aliasing rule. – The Paramagnetic Croissant Aug 09 '14 at 20:44
  • Yeah, it's sort of unfortunate that you need to specify `-fno-strict-aliasing` if you're writing any sort of nontrivial code. – tmyklebu Aug 10 '14 at 17:23
  • @tmyklebu You can do all these things without `-fno-strict-aliasing`. That flag is a cheat for people who don't know how to program. – fuz Aug 10 '14 at 19:33
  • 1
    @FUZxxl: Shrug. I don't know how to do the "point a struct-pointer at the char buffer you give to `read()`" network packet parsing trick without `-fno-strict-aliasing` or incurring an unnecessary copy or separate `read()` calls for the header and the contents. – tmyklebu Aug 10 '14 at 20:19
  • @tmyklebu You seem to not understand aliasing rules. As a special rule, it is assumed that contents of `char*` may alias with anything else so doing exactly this is fine. – fuz Aug 10 '14 at 20:58
  • @FUZxxl: No, I understand them fine. You can access an object whose effective type is `struct foo` through a `char *`, but you can't go the other way. The trick I describe aliases your `char recvbuf[123456]` (or, more accurately, `recvbuf + off` for some offset `off`), via a `struct foo *`; this requires either `-fno-strict-aliasing` or a direct hit to performance. – tmyklebu Aug 10 '14 at 22:08
  • @tmyklebu In this case, you can still use a union type. – fuz Aug 10 '14 at 22:19
  • @FUZxxl: You can, but you need a union type with one member for each possible offset. (Or some other way to make sure there's a properly-offset union member for each admissible offset.) This is quite unwieldy, and, furthermore, I don't believe the C standard guarantees that implementations can handle unions with several hundred thousand members. *Also*, even if you can compile it, this technique means your networking code needs to have definitions of every possible structure you might overlay the receive-buffer with, which limits its generality. – tmyklebu Aug 10 '14 at 22:52