3

In the following example running a 32-bit ELF on a 64-bit architecture is faster and I don't understand why. I have tried with two examples one using a division the other one with a multiplication. The performance is as expected, however, the performance for the division is surprizing.

We see on the assembly that the compiler is calling _alldiv which emulates a 64-bit division on a 32-bit architecture, so it must be slower than simply using the assembly instruction idiv. So I don't understand the results I got:

My setup is: Windows 10 x64, Visual Studio 2019

To time the code I use Measure-Command { .\out.exe }:

  • Multiplication
    • 32-bit ELF: 3360 ms
    • 64-bit ELF: 1469 ms
  • Division
    • 32-bit ELF: 7383 ms
    • 64-bit ELF: 8567 ms

Code

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <limits.h>
#include <Windows.h>

volatile int64_t m = 32;
volatile int64_t n = 12;
volatile int64_t result;

int main(void)
{
    for (size_t i = 0; i < (1 << 30); i++)
    {
#       ifdef DIVISION
        result = m / n;
#       else 
        result = m * n;
#       endif
        m += 1;
        n += 3;
    }
}

64-bit disassembly (division)

    for (size_t i = 0; i < (1 << 30); i++)
00007FF60DA81000  mov         r8d,40000000h  
00007FF60DA81006  nop         word ptr [rax+rax]  
    {
        result = m / n;
00007FF60DA81010  mov         rcx,qword ptr [n (07FF60DA83038h)]  
00007FF60DA81017  mov         rax,qword ptr [m (07FF60DA83040h)]  
00007FF60DA8101E  cqo  
00007FF60DA81020  idiv        rax,rcx  
00007FF60DA81023  mov         qword ptr [result (07FF60DA83648h)],rax  
        m += 1;
00007FF60DA8102A  mov         rax,qword ptr [m (07FF60DA83040h)]  
00007FF60DA81031  inc         rax  
00007FF60DA81034  mov         qword ptr [m (07FF60DA83040h)],rax  
        n += 3;
00007FF60DA8103B  mov         rax,qword ptr [n (07FF60DA83038h)]  
00007FF60DA81042  add         rax,3  
00007FF60DA81046  mov         qword ptr [n (07FF60DA83038h)],rax  
00007FF60DA8104D  sub         r8,1  
00007FF60DA81051  jne         main+10h (07FF60DA81010h)  
    }
}

32-bit disassembly (division)

    for (size_t i = 0; i < (1 << 30); i++)
00A41002  mov         edi,40000000h  
00A41007  nop         word ptr [eax+eax]  
    {
        result = m / n;
00A41010  mov         edx,dword ptr [n (0A43018h)]  
00A41016  mov         eax,dword ptr ds:[00A4301Ch]  
00A4101B  mov         esi,dword ptr [m (0A43020h)]  
00A41021  mov         ecx,dword ptr ds:[0A43024h]  
00A41027  push        eax  
00A41028  push        edx  
00A41029  push        ecx  
00A4102A  push        esi  
00A4102B  call        _alldiv (0A41CD0h)  
00A41030  mov         dword ptr [result (0A433A0h)],eax  
00A41035  mov         dword ptr ds:[0A433A4h],edx  
        m += 1;
00A4103B  mov         eax,dword ptr [m (0A43020h)]  
00A41040  mov         ecx,dword ptr ds:[0A43024h]  
00A41046  add         eax,1  
00A41049  mov         dword ptr [m (0A43020h)],eax  
00A4104E  adc         ecx,0  
00A41051  mov         dword ptr ds:[0A43024h],ecx  
        n += 3;
00A41057  mov         eax,dword ptr [n (0A43018h)]  
00A4105C  mov         ecx,dword ptr ds:[0A4301Ch]  
00A41062  add         eax,3  
00A41065  mov         dword ptr [n (0A43018h)],eax  
00A4106A  adc         ecx,0  
00A4106D  mov         dword ptr ds:[0A4301Ch],ecx  
00A41073  sub         edi,1  
00A41076  jne         main+10h (0A41010h)  
    }
}

Edit

To investigate further as Chris Dodd, I have slightly modified my code as follow:

volatile int64_t m = 32000000000;
volatile int64_t n = 12000000000;
volatile int64_t result;

This time I have these results:

  • Division
    • 32-bit ELF: 22407 ms
    • 64-bit ELF: 17812 ms
nowox
  • 25,978
  • 39
  • 143
  • 293
  • Be curious of the dump of _alldiv to see what it is doing? Could be a cached table of divides for small integer. idiv is a known pig on x86 so perhaps that is faster? – Michael Dorgan Aug 07 '19 at 17:49
  • 2
    @MichaelDorgan It seems `__alldiv` is: http://www.jbox.dk/sanos/source/lib/lldiv.asm.html – nowox Aug 07 '19 at 17:53
  • Double check that optimizations were turned on for this build? – Michael Dorgan Aug 07 '19 at 20:24
  • The only way i see this being faster is if the integer multiplier was smart enough to change those divides into fixed point muls. I am stumped. – Michael Dorgan Aug 07 '19 at 20:29
  • 1
    MSVC can make ELF executables? WSL can't even run 32-bit Linux/ELF executables (only x86-64), so how did you even run this? Also, why introduce a store/reload memory dependency by making both input variables `volatile` as well as incrementing them? – Peter Cordes Aug 08 '19 at 04:07
  • Possible duplicate of [What is the reasoning behind x64 having such a different performance result from x86?](//stackoverflow.com/q/44556614) where it mostly boils down to 32-bit code checking and using one 32-bit `div` instead of one 64-bit `div` for those input numbers. But that question has many levels of libraries and JIT to wade through before getting to that issue. – Peter Cordes Aug 08 '19 at 04:13
  • 1
    @nowox: Slightly OT note: In case you don't know it already, you should have a look at Agner Fog's homepage: https://agner.org/optimize/. It contains a lot of valueable resources regarding this topic... – andreee Aug 08 '19 at 07:23
  • 1
    Why are you using the term "[ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)"? That's the name of an executable format used on Linux and Unix. Is that just a typo for EXE / executable? Windows uses the PE/COFF executable format for normal Windows `.exe` files like you're probably making, not ELF. – Peter Cordes Aug 08 '19 at 07:28

1 Answers1

5

If you look at instruction timings for x86 processors, it turns out that on recent Intel processors, a 64-bit divide is 3-4x as expensive as a 32-bit divide -- and if you look at the internals of alldiv (link in the comments above), for your values which will always fit in 32 bits, it will use a single 32-bit divide...

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • 1
    [Clang has the same capability](https://stackoverflow.com/q/54438477/995714) and will check if the operands fit in 32 bits to use the 32-bit div – phuclv Aug 08 '19 at 03:58
  • Near duplicate for 64-bit `div` being much slower than 32-bit `div` on Intel CPUs: [Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux](//stackoverflow.com/q/29983453). – Peter Cordes Aug 08 '19 at 04:12