Recently I had to write a code for critical real time functionality and I used few __builtin_... functions. I understand that such code is not portable because not all the compilers support "__builtin_..." functions or syntax. I was wondering if there is a way to write code in a plain C so that the compiler would be able to recognize it and use some internal "__builtin_..."-like function?
Below is a description of a small experement I did but my question is:
- Are there any tips, best known methods, guidelines to write a portable C code so that the compiler would be able to detect (let's put aside the compiler bugs) the pattern and use the maximum ability of the target CPU architecture.
For example reverse bytes in a Dword (so that the first byte become the last one, the last one becomes the first one and so on), the x86_64 architecture has a dedicated assembly instruction for it - bswap
. I tried 4 different options:
#include <stdint.h>
#include <stdlib.h>
typedef union _helper_s
{
uint32_t val;
uint8_t bytes[4];
} helper_u;
uint32_t reverse(uint32_t d)
{
helper_u b;
uint8_t temp;
b.val = d;
temp = b.bytes[0];
b.bytes[0] = b.bytes[3];
b.bytes[3] = temp;
temp = b.bytes[1];
b.bytes[1] = b.bytes[2];
b.bytes[2] = temp;
return b.val;
}
uint32_t reverse1(uint32_t d)
{
helper_u b;
uint8_t temp;
b.val = d;
for (size_t i = 0; i < sizeof(uint32_t) / 2; i++)
{
temp = b.bytes[i];
b.bytes[i] = b.bytes[sizeof(uint32_t) - i - 1];
b.bytes[sizeof(uint32_t) - i - 1] = temp;
}
return b.val;
}
uint32_t reverse2(uint32_t d)
{
return (d << 24) | (d >> 24 ) | ((d & 0xFF00) << 8) | ((d & 0xFF0000) >> 8);
}
uint32_t reverse3(uint32_t d)
{
return __builtin_bswap32(d);
}
All the options provide the same functionality. I compiled it with different compilers and different optimization levels, the results were not so good:
GCC - did great! For both
-O3
and-Os
optimization levels it gave the same result for all the functions:reverse: mov eax, edi bswap eax ret reverse1: mov eax, edi bswap eax ret reverse2: mov eax, edi bswap eax ret reverse3: mov eax, edi bswap eax ret
Clang a little disappointed me. With the
-O3
it gave the same result as GCC however with the-Os
it totally lost the path inreverse1
. It didn't recognize the pattern and produced way less optimal binary:reverse1: # @reverse1 lea rax, [rsp - 8] mov dword ptr [rax], edi mov ecx, 3 .LBB1_1: # =>This Inner Loop Header: Depth=1 mov sil, byte ptr [rax] mov dl, byte ptr [rsp + rcx - 8] mov byte ptr [rax], dl mov byte ptr [rsp + rcx - 8], sil dec rcx inc rax cmp rcx, 1 jne .LBB1_1 mov eax, dword ptr [rsp - 8] ret
Actually the difference between
reverse
andreverse1
is thatreverse
is the "loop unrolled" version ofreverse1
, so I assume that with-Os
the compiler didn't even try to unroll or try to anticipate the purpose of thefor
loop.With the ICC, the things went even worse because it was unable to recognize the pattern in
reverse
andreverse1
functions both with the-O3
and the-Os
optimization levels.
P.S.
I often hear people say that the code has to be written so that even junior programmer would easily be able to understand it and the modern compilers are "smart" enough to take care of the optimizations. Now I have an evidence that it is not true (or at least not always true).