2

Can someone help me understand benefit of using asm block for unsigned long long multiplication in terms of performance. It is related to competitive programming optimization. I guess it makes multiplication faster, but I can't understand code actually.

const int md = 998244353;
inline int mul(int a, int b)
{
#if !defined(_WIN32) || defined(_WIN64)
    return (int) ((long long) a * b % md);
#endif
    unsigned long long x = (long long) a * b;
    unsigned xh = (unsigned) (x >> 32), xl = (unsigned) x, d, m;
    asm(
            "divl %4; \n\t"
            : "=a" (d), "=d" (m)
            : "d" (xh), "a" (xl), "r" (md)
    );
    return m;
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    Regardless of how it works, this code is most probably outdated when using a modern compiler, as they should generate a better or comparable output just from the naive version. – Bartek Banachewicz Oct 03 '18 at 20:38
  • 1
    Pro Tip: Compilers are written by a lot of *really* smart people. If you give them obvious code they do a really good job of giving you the best possible machine code. That means understandable code often turns into performant code. – NathanOliver Oct 03 '18 at 20:40
  • What compiler are you using? VS2017 won't compile it for various reasons. – Paul Sanders Oct 03 '18 at 20:44
  • 1
    FWIW, [GCC 8.2 is able to boil it down to just `imul` calls, while Clang still uses `idiv`](https://godbolt.org/z/CaTjkM). – Bartek Banachewicz Oct 03 '18 at 20:44
  • actually I want to understand how this code works ? – John McLane Oct 03 '18 at 20:51
  • 2
    @BartekBanachewicz: That's because you disabled optimization! See https://godbolt.org/z/qtQkxc for much better code from clang and gcc `-O3`. All modern compilers know how to use a multiplicative inverse for division by a compile-time constant, but gcc is the only one that does so even at `-O0`, IIRC. [Why does GCC use multiplication by a strange number in implementing integer division?](https://stackoverflow.com/q/41183935) – Peter Cordes Oct 03 '18 at 20:54
  • 3
    @NathanOliver: If only that would be true! Compilers generate good code usually, but almost never the best possible. – geza Oct 03 '18 at 20:57
  • This code is actually a win for 32-bit (where 64x64 => 128 multiplication isn't available so compilers use actual division), but loses badly on 64-bit. It should really use `__builtin_constant_p` to only use inline asm if either input is not a compile-time constant after inlining and constant-propagation. – Peter Cordes Oct 03 '18 at 21:00
  • To answer the specific question about what that code does. The non-MSVC code uses extended inline assembly that utilizes the `div` instruction to compute the remainder portion of the equation.The div instruction places the remainder of 64-bit division of EDX:EAX by a 32-bit register and places the quotient in EAX and remainder in EDX. This code assumes that the division won't overflow. – Michael Petch Oct 03 '18 at 21:01
  • @MichaelPetch: Why not post that as an answer? – Fred Larson Oct 03 '18 at 21:03

1 Answers1

5

This code is actually a speedup for 32-bit (where 64x64 => 128 multiplication isn't available so compilers use actual division, but loses badly on 64-bit where compilers do use a multiplicative inverse to avoid slow hardware division entirely. Why does GCC use multiplication by a strange number in implementing integer division?

Also, it should really use __builtin_constant_p to only use inline asm if either input is not a compile-time constant after inlining and constant-propagation.


But anyway, x86's div instruction does EDX:EAX / (src) => quotient(EAX) and divisor(EDX). See When and why do we sign extend and use cdq with mul/div?

The "a" and "d" constraints ask for the low and high halves of the 64-bit product in EAX and EDX respectively as inputs.

From the Godbolt compiler explorer:

const int md = 998244353;
int mul(int a, int b)
{
#ifdef __x86_64__ // FIXME: just use the asm if defined(i386) to exclude all others
    return (int) ((long long) a * b % md);
#else
    if(__builtin_constant_p(a) && __builtin_constant_p(b))
        return (int) ((long long) a * b % md);
      // clang's builtin_constant_p is evaled before inlining, derp

    unsigned long long x = (long long) a * b;
    unsigned xh = (unsigned) (x >> 32), xl = (unsigned) x, d, m;
    asm(
            "divl %4; \n\t"
            : "=a" (d), "=d" (m)
            : "d" (xh), "a" (xl), "r" (md)
    );
    return m;
#endif
}

int main() {
    return mul(1,2);
}

compiles as follows with gcc8.2 -O3 -m32:

mul(int, int):
    mov     eax, DWORD PTR [esp+8]
    mov     ecx, 998244353
    imul    DWORD PTR [esp+4]     # one-operand imul does EDX:EAX = EAX * src
    divl ecx;                     # EDX:EAX / ecx => EAX and EDX

    mov     eax, edx              # return the remainder
    ret

main:
    mov     eax, 2     # builtin_constant_p used the pure C, allowing constant-propagation
    ret

Notice that div is unsigned division, so this is doesn't match the C. The C is doing signed multiplication and signed division. This should probably use idiv, or cast the inputs to unsigned. Or maybe they really do want weird results with negative inputs for some strange reason.

So why can't compilers emit this on their own, without inline asm? Because if the quotient overflows the destination register (al/ax/eax/rax), it faults with a #DE (Divide Exception) instead of silently truncating like all other integer instructions.

A 64-bit / 32-bit => 32-bit division is only safe if you know the divisor is large enough for the possible dividends. (But even if it is, gcc still doesn't know look for this optimization. e.g. a * 7ULL / 9 can't possibly cause a #DE if done with a single mul and div, if a was a 32-bit type. But gcc will still emit a call to a libgcc helper function.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847