Why does static_cast conversion speed up an un-optimized build of my integer division function?

Question

... or rather, why does not static_cast-ing slow down my function?

Consider the function below, which performs integer division:

int Divide(int x, int y) {
  int ret = 0, i = 32;
  long j = static_cast<long>(y) << i;
  while (x >= y) {
    while (x < j) --i, j >>= 1;
    ret += 1 << i, x -= j;
  }
  return ret;
}

This performs reasonably well, as one might expect. However, if we remove the static_cast on line 3, like so:

int Divide(int x, int y) {
  int ret = 0, i = 32;
  long j = y << i;
  while (x >= y) {
    while (x < j) --i, j >>= 1;
    ret += 1 << i, x -= j;
  }
  return ret;
}

This version performs noticeably slower, sometimes several hundreds times slower (I haven't measured rigorously, but shouldn't be far off) for pathological inputs where x is large and y is small. I was curious and wanted to look into why, and tried digging into the assembly code. However, apart from the casting differences on line 3, I get the exact same output. Here's the line 3 output for reference (source):

With static_cast:

movsxd  rax, dword ptr [rbp - 8]
mov     ecx, dword ptr [rbp - 16]
shl     rax, cl
mov     qword ptr [rbp - 24], rax

Without static_cast:

mov     eax, dword ptr [rbp - 8]
mov     ecx, dword ptr [rbp - 16]
shl     eax, cl
cdqe
mov     qword ptr [rbp - 24], rax

The rest is identical.

I'm really curious where the overhead is occurring.

EDIT: I've tested a bit further, and it looks like the while loop is where most of the time is spent, not when y is initialized. The additional cdqe instruction doesn't seem to be significant enough to warrant the total increase in wall time.

Some disclaimers, since I've been getting a lot of comments peripheral to the actual question:

I'm aware that shifting an int further than 32 bits is UB.
I'm assuming only positive inputs.
long is 8 bytes long on my platform, so it doesn't overflow.

I'd like to know what might be causing the increased runtime, which the comments criticizing the above don't actually address.

The `int` type is typically 32 bits. Attempting to shift a 32-bit value by 32 bits (or more) leads to *undefined behavior*. So any speculation about your program is moot as long as it has undefined behavior. — Some programmer dude, Mar 17 '22 at 08:49
On my machine, `long` and `int` are same types in practice (32b). When compiling, did you allow optimisation flags (O2 for example). — Damien, Mar 17 '22 at 08:49
There could also be problems if `y` happens to be negative. When doing bitwise operations, you should always work on unsigned types. See e.g. [this shift operator reference](https://en.cppreference.com/w/cpp/language/operator_arithmetic#Bitwise_shift_operators) for more information. — Some programmer dude, Mar 17 '22 at 08:53
are you aware of what the comma operator does here `ret += 1 << i, x -= j;` ? — 463035818_is_not_an_ai, Mar 17 '22 at 08:56
Lastly please don't overuse the comma operator, it tend to make code harder to read and understand. And in your case it might not even work as you expect it to. If you need to do separate operations in sequence, use plain statements. — Some programmer dude, Mar 17 '22 at 08:56
My fault for not clarifying, but a) I am aware of the UB regarding shifting an int past 32, b) long is 8 bits on my platform, and c) I am well aware of the downsides of using the comma operator—was just trying to condense the code vertically :P — Min, Mar 17 '22 at 09:04
@Min: Personally I like statements like `while (x < j) --i, j >>= 1;` in arithmetic code. Cuts down on braces which can make things clearer. — Bathsheba, Mar 17 '22 at 09:09
How did you benchmark this? Compile in debug mode with clang 13.0.1 as shown in your Godbolt link? And call from what timed region? A factor of "several hundred" is completely implausible unless one version optimized away completely with optimization enabled (e.g. result unused, and/or compile-time-visible UB leading to no code. Or called with constant args with optimization enabled). Much more likely to be from warm-up effects and sloppy benchmarking ([Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)). — Peter Cordes, Mar 17 '22 at 14:23
The asm difference you show in the debug builds is just 1 extra cycle of latency outside of any loops (separately sign-extending the shift result, instead of folding the sign-extension into a load). Even for that snippet alone, there's already store/reload latency of about 5 cycles on a typical CPU, plus 1 for the shift (https://agner.org/optimize/), so we're talking about 7 vs. 6 cycles or so, obviously not factors of 100. (Latency will be the bottleneck because this is a debug build reusing the same variables.) — Peter Cordes, Mar 17 '22 at 14:30
Or wait, I guess the loop has data-dependent run-time, so the difference between `j = (long)(y << 0)` (i.e. just sign-extend `j` unmodified) vs. shifting `y` to the high half could be very significant in terms of how many iterations the loop runs. — Peter Cordes, Mar 17 '22 at 14:35
Thanks for the enlightening comments, Peter. Could you perhaps expand a bit more on the final point? In particular, I'd appreciate if you could elaborate how the difference would incur additional cost in runtime. I'm seeing especially long wall time for the sort of cases that you describe (long loops; e.g. when `x` is 1527274318 and `y` is 1), with optimization set to -O2. If you could post it as an answer, I'd like to accept it! — Min, Mar 17 '22 at 15:05

score 9 · Accepted Answer · answered Mar 17 '22 at 15:15

Widening after the shift reduces your loop to naive repeated subtraction

It's not the run-time of cdqe or movsxd vs. mov that's relevant, it's the different starting values for your loop, resulting in a different iteration count, especially for pathological cases.

Clang without optimization compiled your source exactly the way it was written, doing the shift on an int and then sign-extending the result to long. The shift-count UB is invisible to the compiler with optimization disabled because, for consistent debugging, it assumes variable values can change between statements, so the behaviour depends on what the target machine does with a shift-count by the operand-size.

When compiling for x86-64, that results in long j = (long)(y<<0);, i.e. long j = y;, rather than having those bits at the top of a 64-bit value.

x86 scalar shifts like shl eax, cl mask the count with &31 (except with 64-bit operand size) so the shift used a count of 32 % 32 == 0. AArch64 would I think saturate the shift count, i.e. let you shift out all the bits.

Notice that it does a 32-bit operand-size shl eax, cl and then sign-extends the result with cdqe, instead of doing a sign-extending reload of y and then a 64-bit operand-size shl rax,cl.

Your loop has a data-dependent iteration count

If you single-step with a debugger, you could see the local variable values accurately. (That's the main benefit of an un-optimized debug build, which is not what you should be benchmarking.) And you can count iterations.

  while (x >= y) {
    while (x < j) --i, j >>= 1;
    ret += 1 << i, x -= j;
  }

With j = y, if we enter the outer loop at all, then the inner loop condition is always false.
So it never runs, j stays constant the whole time, and i stays constant at 32.

1<<32 again compiles to a variable-count shift with 32-bit operand-size, because 1 has type int. (1LL has type long long, and can safely be left-shifted by 32). On x86-64, this is just a slow way to do ret += 1;.

x -= j; is of course just x -= y;, so we're counting how many subtractions to make x < y.

It's well-known that division by repeated subtraction is extremely slow for large quotients, since the run time scales linearly with the quotient.

You do happen to get the right result, though. Yay.

BTW, long is only 32-bit on some targets like Windows x64 and 32-bit platforms; use long long or int64_t if you want a type twice the width of int. And maybe static_assert to make sure int isn't that wide.

With optimization enabled, I think the same things would still hold true: clang looks like it's compiling to similar asm just without the store/reload. So it's effectively / de-facto defining the behaviour of 1<<32 to just compile to an x86 shift instruction.

But I didn't test, that's just from a quick look at the asm https://godbolt.org/z/M33vqGj5P and noting things like mov r8d, 1 ; shl r8d, cl (32-bit operand-size) ; add eax, r8d

Bathsheba · Answer 2 · 2022-03-17T15:27:56.140

2

I'm keeping this answer up for now as the comments are useful.

edited Mar 17 '22 at 15:27

answered Mar 17 '22 at 08:49

Bathsheba

231,907
34
361
483

Copy-pasting from my comment below: Oh, I definitely know this is incorrect, and potentially may cause overflow (the compiler warns me as much). But as far as correctness goes, the function works when compiled on Clang 13.1.6 (which I'm using). I was curious where the time overhead was coming from. – Min Mar 17 '22 at 08:54
1

Stack Overflow sometimes gets a bad press. Occasionally a good page such as the one here on `chqe` here more than makes up for things: https://stackoverflow.com/questions/54618685/what-is-the-meaning-use-of-the-movzx-cdqe-instructions-in-this-code-output-by-a – Bathsheba Mar 17 '22 at 08:56
Thanks, this was the answer I was looking for! Re: bad press, I have myself to blame for posting code with UB and not clarifying my intentions clearly :) – Min Mar 17 '22 at 08:59
1

UB compiler reasoning. "There can't possibly be any UB in the code" reasons the compiler. But look at `y << 32`. Well that must mean that `y` is not a 32 bit signed type. So it must be a 64 bit signed type since I don't have anything in between at my disposal. – Bathsheba Mar 17 '22 at 09:00
1

I've tested a bit further, and it looks like the while loop is where most of the time is spent, not when `y` is initialized. The additional `cdqe` instruction doesn't seem to be significant enough to warrant the total increase in wall time. – Min Mar 17 '22 at 13:42
1

A debug build means the compiler can't see the UB at compile time. It forgets about variable values between statements, and even spills them to memory for consistent debugging. [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394). If the version @Min benchmarked is really what they linked on Godbolt, and showed asm snippets from, the UB is completely irrelevant. x86 SHL is well defined for all input values: with 32-bit operand-size it masks the shift count with `&31`, so the UB version compiles to `j = (long)(y << 0)` – Peter Cordes Mar 17 '22 at 14:36
1

That may affect how many iterations the loop runs, and probably also correctness of the final result... But I'd hardly say clang "made a mess", it just sign-extended the 32-bit result of the shift, doing exactly what the source said to do. With `static_cast`, it sign extends the input and then shifts that. (The sign-extension is unnecessary for a shift count of 32, all the extra bits created by sign-extension get shifted out. But it's not optimizing across statements so it doesn't know the count.) Optimized builds from clang (https://godbolt.org/z/hzE7aErY7) both keep those semantics. – Peter Cordes Mar 17 '22 at 14:38
3

@Bathsheba: Posted an answer: the compiler didn't "get confused", it just did what it was told, resulting in a repeated-subtraction loop which happens to produce the right answer. – Peter Cordes Mar 17 '22 at 15:19

eerorika · Answer 3 · 2022-03-17T08:55:25.460

2

int Divide(int x, int y) {
  int ret = 0, i = 32;
  long j = y << i;

On most systems, the size of int is 32 bits or less. Left-shifting a signed integer by equal or higher number of bits as its size results in undefined behaviour. Don't do this. Since the program is broken, it's irrelevant whether it's slower or faster.

_{Sidenote: Left shifting a signed 32 bit integer by 31 or fewer bits may also be undefined if that shift causes the left most bit to change due to arithmetic overflow.}

edited Mar 17 '22 at 08:55

answered Mar 17 '22 at 08:50

eerorika

232,697
12
197
326

Oh, I definitely know this is incorrect, and potentially may cause overflow (the compiler warns me as much). But as far as correctness goes, the function works when compiled on Clang 13.1.6 (which I'm using). I was curious where the time overhead was coming from. – Min Mar 17 '22 at 08:53
4

@Min That's the danger of UB. Program may *appear* to "work" when it actually doesn't work. You just got unlucky. – eerorika Mar 17 '22 at 08:54
Precisely. When UB doesn't blow up, that's unlucky. – sehe Mar 24 '22 at 15:07

Why does static_cast conversion speed up an un-optimized build of my integer division function?

3 Answers3

Widening after the shift reduces your loop to naive repeated subtraction

Your loop has a data-dependent iteration count