10

I have written the following very simple code which I am experimenting with in godbolt's compiler explorer:

#include <cstdint>

uint64_t func(uint64_t num, uint64_t den)
{
    return num / den;
}

GCC produces the following output, which I would expect:

func(unsigned long, unsigned long):
        mov     rax, rdi
        xor     edx, edx
        div     rsi
        ret

However Clang 13.0.0 produces the following, involving shifts and a jump even:

func(unsigned long, unsigned long):                              # @func(unsigned long, unsigned long)
        mov     rax, rdi
        mov     rcx, rdi
        or      rcx, rsi
        shr     rcx, 32
        je      .LBB0_1
        xor     edx, edx
        div     rsi
        ret
.LBB0_1:
        xor     edx, edx
        div     esi
        ret

When using uint32_t, clang's output is once again "simple" and what I would expect.

It seems this might be some sort of optimization, since clang 10.0.1 produces the same output as GCC, however I cannot understand what is happening. Why is clang producing this longer assembly?

Gary Allen
  • 1,218
  • 1
  • 13
  • 28

2 Answers2

13

The assembly seems to be checking if either num or den is larger than 2**32 by shifting right by 32 bits and then checking whether the resulting number is 0. Depending on the decision, a 64-bit division (div rsi) or 32-bit division (div esi) is performed.

Presumably this code is generated because the compiler writer thinks the additional checks and potential branch outweigh the costs of doing an unnecessary 64-bit division.

Botje
  • 26,269
  • 3
  • 31
  • 41
  • Ah, I see it so clearly now haha. Thanks a lot! – Gary Allen Dec 07 '21 at 10:03
  • On another note, any clue why MSVC is using the stack so much for their equivalent? https://godbolt.org/z/Gnnzq84TW @Botje – Gary Allen Dec 07 '21 at 10:05
  • 2
    @GaryAllen: you forgot to look at MSVC's warning output: `ignoring unknown option '-O3'`. MSVC only supports `-O2` and `-Ox` (and 0 and 1). – Peter Cordes Dec 07 '21 at 10:07
  • 2
    And yes, this is optimizing for the common case of small integers in wide variables, because Intel CPUs before Ice Lake are a lot slower at `[i]div r64` than `r64`, even for the same numeric inputs. [uops for integer DIV instruction](https://stackoverflow.com/q/63153788) (AMD, and Ice Lake and later, don't have that problem so there's no benefit, that's why clang doesn't do this for `-mtune=znver1` https://godbolt.org/z/YnWa74bbf, and shouldn't for `-mtune=icelake-client` but unfortunately still does; in those cases it's pure downside because the hardware takes care of it.) @GaryAllen – Peter Cordes Dec 07 '21 at 10:11
  • 1
    @bartop: MSVC does understand `-O2`, in general accepting `-` not just DOS-style `/`. It's just that `3` isn't a valid optimization level, and unlike GCC and clang, it doesn't treat higher integers as max optimization. (e.g. clang `-O9` is currently the same as `-O3`.) – Peter Cordes Dec 07 '21 at 10:14
2

If I understand correctly, it just checks if any of the operands is larger than 32-bits and uses different div for "up to" 32 bits and for larger one.

Karol T.
  • 543
  • 2
  • 13