Why is x/10 optimized with an unnecessary shift when x has a restricted range?

Question

I have this function

long long int divideBy10(long long int a){
    return a / 10;
}

it's compiled to:

        mov     rax, rdi
        movabs  rcx, 7378697629483820647
        imul    rcx
        mov     rax, rdx
        shr     rax, 63
        sar     rdx, 2
        add     rax, rdx
        ret

if I add __builtin_assume(a > 0);

it's compiled to

        mov     rax, rdi
        movabs  rcx, -3689348814741910323
        mul     rcx
        mov     rax, rdx
        shr     rax, 3
        ret

The code is more efficient because it doesn't have to worry about negative signs. Now if I add __builtin_assume(a < 10000); I would've expected it to be compiled to one multiplication without shifts. but that's not the case.

I thought maybe the compiler only tracks if the number is positive or negative but

long long int noBranch(long long int a){
    __builtin_assume(a < 400);
    if( a < 500){
        return a;
    }
    return 0;
}

compiles to just a move instruction so the compiler is able to track these bounds.

Why are compilers not able to optimize the shift away? (this happens both in GCC and Clang).

This is the code, that I expected to be produced:

long long int d10(long long int a){
    __builtin_assume(a > 0);
    __builtin_assume(a < 10'000);
    __uint128_t e = a;
    e *= 0x199999999999999A;
    return e >> 64;

}

d10(long long):                                # @d10(long long)
        mov     rax, rdi
        movabs  rcx, 1844674407370955162
        mul     rcx
        mov     rax, rdx
        ret

Edit: I'm aware that you need a shift to be able to cover the whole range of long long int. I'm asking why can't compiler optimize the shift away if the range is restricted.

See also [Labor of division](https://ridiculousfish.com/blog/posts/labor-of-division-episode-i.html) for the exact mathematical reason. — Jester, Jun 10 '20 at 17:13
@Jester I assume that's the case, but why? for the specific case of 10 that's not true. note the __builtin_assume(a < 10'000); that bounds the number from above. — Slei, Jun 10 '20 at 17:14
May I ask which specific multiplicand works in that general case of (0, 10000) if you don't shift? — chris, Jun 10 '20 at 17:19
@chrs Good advice. Slei: do this "paper and pencil" style (even if using a calculator or octave or whatever as an arithmetic aid). It should illuminate that it's not so simple :) — Kuba hasn't forgotten Monica, Jun 10 '20 at 17:23
Yes, that does seem to work for the given range. Seems you are right, it could be used. — Jester, Jun 10 '20 at 17:31
I reopened and changed the title to make it clear that this question is not the same as how the optimization works in general. It was pretty easy to read the title and jump to that conclusion. — chris, Jun 10 '20 at 17:44
Consider converting your examples to `unsigned long`. Your original example has 2 shifts, including the sign-bit fixup (`shr rax, 63`) for negative integers, because signed division is harder. That *does* go away with a range limit, so I guess you're asking why GCC doesn't use a simpler multiplicative inverse for limited-range? With a different constant as well. Presumably just because it doesn't bother to look for that special case. I doubt value-range info is available in most code so most code wouldn't get any faster. But feel free to file a missed-optimization bug for gcc and clang. — Peter Cordes, Jun 10 '20 at 17:47
Super Micro peephole ops like this are generally beyond what day to day compiler authors are going to worry about. If you truly need that extra fractional cycle back from the shift, then by all means code it yourself :) — Michael Dorgan, Jun 10 '20 at 18:10
"why GCC doesn't use a simpler multiplicative inverse for limited-range?" Thanks, that's exactly my question . I guess you're right compiler developers would be more able to answer it. But wouldn't it be considered rude file a bug over it? — Slei, Jun 10 '20 at 18:14
@Slei: "less than perfect optimisation" isn't considered a bug, it's just the price you pay for other conveniences (portability, etc); so a bug report wouldn't be appropriate. The compiler developer might have a way to submit a "suggestion for optimizer improvement". — Brendan, Jun 10 '20 at 19:20
@Brendan Quite on the opposite; “missed optimisation” is a perfectly cromulent category of bug reports for compilers. — fuz, Jun 11 '20 at 11:03
For future reference, the faster div10 that's only exact for limited input ranges is described here: [Divide by 10 using bit shifts?](https://stackoverflow.com/a/5558614). It's only a little bit faster, and can be written by hand, so not really worth the compile-time to look for on every division. — Peter Cordes, Mar 29 '21 at 16:07
Re: the wasted `mov` instead of doing `mov rax, imm64` / `mul rdi`: [Is an extra move somehow faster when doing division-by-multiplication?](https://stackoverflow.com/q/61470328) — Peter Cordes, Mar 29 '21 at 16:15

Why is x/10 optimized with an unnecessary shift when x has a restricted range?

0 Answers0