Following the other answer, there are better reduction functions, than folding the top bits over low bits.
First starting with the original problem, the generated code is actually not that bad.
bool func0(uint64_t new_ruid,uint64_t old_euid, uint64_t old_ruid,
uint64_t new_euid, uint64_t old_suid) {
return
(new_ruid != old_euid && new_ruid != old_ruid) ||
(new_euid != old_euid && new_euid != old_ruid && new_euid != old_suid);
}
_Z5func0mmmmm: # @_Z5func0mmmmm
cmpq %rsi, %rdi
je .LBB0_2
movb $1, %al
cmpq %rdx, %rdi
je .LBB0_2
retq
.LBB0_2:
cmpq %rsi, %rcx
setne %al
cmpq %rdx, %rcx
setne %dl
andb %al, %dl
cmpq %r8, %rcx
setne %al
andb %dl, %al
retq
Only the first portion contains conditional branches, while the latter part is fully unrolled to set_conditionals.
Thus we have about three candidates for reduction:
uint32_t reduct1(uint64_t a, uint64_t b) {
a ^= b;
return (a >> 32) | (a & 0xffffffff);
}
movq %rdi, %rax
xorq %rsi, %rax
movq %rax, %rcx
shrq $32, %rcx
orl %ecx, %eax
uint8_t reduct2(uint64_t a, uint64_t b) {
return __builtin_popcountl(a ^ b);
}
xorq %rsi, %rdi
popcntq %rdi, %rax
uint8_t reduct3(uint64_t a, uint64_t b) {
return a != b ? 1u : 0u;
}
cmpq %rsi, %rdi
setne %al
The first variant is quite bloated compared to the two other version, in addition of it suffering from the problem of the multiplicative overflow.
The popcount
variant returns values in range of 1-64
in case the input variables differ, meaning that the multiplicative overflow is solved.
However, the third variant is strength reduced to logical-and, which is way faster than multiplication -- and furthermore it allows a non-destructive comparison of the operands, which probably beats the popcount version (which OTOH has more opportunities for instruction level parallelism).
bool func( uint64_t new_ruid, uint64_t old_euid, uint64_t old_ruid,
uint64_t new_euid, uint64_t old_suid) {
auto a = reduct3(new_ruid, old_euid);
auto b = reduct3(new_ruid, old_ruid);
auto c = reduct3(new_euid, old_euid);
auto d = reduct3(new_euid, old_ruid);
auto e = reduct3(new_euid, old_suid);
return (a * b) || (c * d * e);
}
cmpq %rsi, %rdi
setne %r9b
cmpq %rdx, %rdi
setne %dil
cmpq %rsi, %rcx
setne %sil
cmpq %rdx, %rcx
setne %al
cmpq %r8, %rcx
setne %cl
andb %r9b, %dil
andb %sil, %al
andb %cl, %al
orb %dil, %al
The problem can be solved also in SSE2 domain, with careful selection of parameters:
XMM0 = new_ruid | new_euid
XMM1 = old_euid | old_euid
XMM2 = old_ruid | old_ruid
XMM3 = old_ruid | old_suid <- we compare new_ruid twice to same value
XMM1 = _mm_cmpeq_epi64(XMM1, XMM0);
XMM2 = _mm_cmpeq_epi64(XMM2, XMM0);
XMM3 = _mm_cmpeq_epi64(XMM3, XMM0);
XMM1 = _mm_or_si128(XMM1, XMM2);
XMM1 = _mm_or_si128(XMM1, XMM3);
On SSE2 there's only ==
, not !=
, so we need to apply De-Morgan a few times, then combine left and right part, and movemask
or movq
the result out. (IDK if it's allowed to use SSE2 on Linux kernel, so this part may not fit the requirement.)