Currently, from research and various attempts, I'm pretty sure that the only solution to this problem is to use assembly. I'm posting this question to show an existing problem, and maybe get attention from compiler developers, or get some hits from searches about similar problems.
If anything changes in the future, I will accept it as an answer.
This is a very related question for MSVC.
In x86_64
machines, it is faster to use div
/idiv
with a 32-bit operand than a 64-bit operand. When the dividend is 64-bit and the divisor is 32-bit, and when you know that the quotient will fit in 32 bits, you don't have to use the 64-bit div
/idiv
. You can split the 64-bit dividend into two 32-bit registers, and even with this overhead, performing a 32-bit div
on two 32-bit registers will be faster than doing a 64-bit div
with a full 64-bit register.
The compiler will produce a 64-bit div
with this function, and that is correct because for a 32-bit div
, if the quotient of the division does not fit in 32 bits, an hardware exception occurs.
uint32_t div_c(uint64_t a, uint32_t b) {
return a / b;
}
However, if the quotient is known to be fit in 32 bits, doing a full 64-bit division is unnecessary. I used __builtin_unreachable
to tell the compiler about this information, but it doesn't make a difference.
uint32_t div_c_ur(uint64_t a, uint32_t b) {
uint64_t q = a / b;
if (q >= 1ull << 32) __builtin_unreachable();
return q;
}
For both div_c
and div_c_ur
, the output from gcc
is,
mov rax, rdi
mov esi, esi
xor edx, edx
div rsi
ret
clang
does an interesting optimization of checking the dividend size, but it still uses a 64-bit div
when the dividend is 64-bit.
mov rax, rdi
mov ecx, esi
mov rdx, rdi
shr rdx, 32
je .LBB0_1
xor edx, edx
div rcx
ret
.LBB0_1:
xor edx, edx
div ecx
ret
I had to write straight in assembly to achieve what I want. I couldn't find any other way to do this.
__attribute__((naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
"mov eax, edi\n\t"
"mov rdx, rdi\n\t"
"shr rdx, 32\n\t"
"div esi\n\t"
"ret\n\t"
);}
Was it worth it? At least perf
reports 49.47%
overhead from div_c
while 24.88%
overhead from div_asm
, so on my computer (Tiger Lake), div r32
is about 2 times faster than div r64
.
This is the benchmark code.
#include <stdint.h>
#include <stdio.h>
__attribute__((noinline))
uint32_t div_c(uint64_t a, uint32_t b) {
uint64_t q = a / b;
if (q >= 1ull << 32) __builtin_unreachable();
return q;
}
__attribute__((noinline, naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
"mov eax, edi\n\t"
"mov rdx, rdi\n\t"
"shr rdx, 32\n\t"
"div esi\n\t"
"ret\n\t"
);}
static uint64_t rdtscp() {
uint32_t _;
return __builtin_ia32_rdtscp(&_);
}
int main() {
#define n 500000000ll
uint64_t c;
c = rdtscp();
for (int i = 1; i <= n; ++i) {
volatile uint32_t _ = div_c(i + n * n, i + n);
}
printf(" c%15ul\n", rdtscp() - c);
c = rdtscp();
for (int i = 1; i <= n; ++i) {
volatile uint32_t _ = div_asm(i + n * n, i + n);
}
printf("asm%15ul\n", rdtscp() - c);
}