... or rather, why does not static_cast-ing slow down my function?
Consider the function below, which performs integer division:
int Divide(int x, int y) {
int ret = 0, i = 32;
long j = static_cast<long>(y) << i;
while (x >= y) {
while (x < j) --i, j >>= 1;
ret += 1 << i, x -= j;
}
return ret;
}
This performs reasonably well, as one might expect.
However, if we remove the static_cast
on line 3, like so:
int Divide(int x, int y) {
int ret = 0, i = 32;
long j = y << i;
while (x >= y) {
while (x < j) --i, j >>= 1;
ret += 1 << i, x -= j;
}
return ret;
}
This version performs noticeably slower, sometimes several hundreds times slower (I haven't measured rigorously, but shouldn't be far off) for pathological inputs where x
is large and y
is small. I was curious and wanted to look into why, and tried digging into the assembly code. However, apart from the casting differences on line 3, I get the exact same output. Here's the line 3 output for reference (source):
With static_cast
:
movsxd rax, dword ptr [rbp - 8]
mov ecx, dword ptr [rbp - 16]
shl rax, cl
mov qword ptr [rbp - 24], rax
Without static_cast
:
mov eax, dword ptr [rbp - 8]
mov ecx, dword ptr [rbp - 16]
shl eax, cl
cdqe
mov qword ptr [rbp - 24], rax
The rest is identical.
I'm really curious where the overhead is occurring.
EDIT: I've tested a bit further, and it looks like the while loop is where most of the time is spent, not when y
is initialized. The additional cdqe
instruction doesn't seem to be significant enough to warrant the total increase in wall time.
Some disclaimers, since I've been getting a lot of comments peripheral to the actual question:
- I'm aware that shifting an int further than 32 bits is UB.
- I'm assuming only positive inputs.
long
is 8 bytes long on my platform, so it doesn't overflow.
I'd like to know what might be causing the increased runtime, which the comments criticizing the above don't actually address.