1

The following code is x64 compiled by g++ 5.4.0. The left is the compiled output. The stuff on the right is what I expected it to look like. Granted the stuf on the right may not be syntactically correct. This code is supposed to essentially do :

if(i % 3 == 0) do_stuff

400512 would take u to the space after the if block

mov    -0x4(%rbp),%ecx              mov -0x4(%rbp), %eax
mov    $0x55555556,%edx             idvq $0x3
mov    %ecx,%eax                    cmp %edx, 0x0
imul   %edx                         jnz 400512 <main+0x3c>
mov    %ecx,%eax
sar    $0x1f,%eax
sub    %eax,%edx
mov    %edx,%eax
mov    %eax,%edx
add    %edx,%edx
add    %eax,%edx
mov    %ecx,%eax
sub    %edx,%eax
test   %eax,%eax
jne    400512 <main+0x3c>

My question for those who are a lot smarter then me: Why the heck do g++ have so much more to calculate modulus, and can someone explain to me what it is doing.

Jester
  • 56,577
  • 4
  • 81
  • 125
  • I may have found the answer where the risc operations take about 1/2 cycles it looks like the idvq may take 10 or so, still investigating – Thomas Watters Dec 05 '16 at 23:45
  • It's optimized for speed, not size ;) Division is slow. See [Labor of division](http://ridiculousfish.com/blog/posts/labor-of-division-episode-i.html) – Jester Dec 05 '16 at 23:45
  • ah thx for the reference i will certainly take a look – Thomas Watters Dec 05 '16 at 23:47
  • 1
    Of course you could just use a counter and get even better code probably ;) – Jester Dec 05 '16 at 23:48
  • lol that is where this all started actually, boss is claiming his answer to an interview question is faster, i was looking at the assembly code at the two when i noticed that the instruction for modulus was a lot more then i thought – Thomas Watters Dec 05 '16 at 23:50
  • Jester that reference was perfect thanks – Thomas Watters Dec 06 '16 at 00:07
  • The code looks on the left looks like it could be improved, do you have optimization turned on? – Ross Ridge Dec 06 '16 at 00:07
  • no optimization was turned off or the entire function gets optimized out – Thomas Watters Dec 06 '16 at 00:09
  • Well, you get better code with optimization turned on: https://godbolt.org/g/h4jRBi – Ross Ridge Dec 06 '16 at 00:13
  • @RossRidge I tried to further shorten it... posted it as answer. ;) – Ped7g Dec 06 '16 at 12:58
  • 1
    gcc at `-O0` still uses its modular multiplicative inverse generator. Some other compilers at `-O0` will emit the naive `idiv` (you're correct, there is a syntax error: IDIV doesn't have an immediate form. But same difference). See http://stackoverflow.com/a/33284629/224132: gcc `-O0` doesn't mean no optimization. 64-bit IDIV is very slow. See [my answer on the recent Collatz-conjecture asm question](http://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat/40355466#40355466) – Peter Cordes Dec 07 '16 at 05:43

1 Answers1

0

For my curiosity, the Ross Ridge's comment modified for unsigned int produces even shorter code:

bar(unsigned int):
    mov     eax, edi
    mov     edx, -1431655765
    mul     edx
    shr     edx
    lea     eax, [rdx+rdx*2]
    cmp     edi, eax
    je      .L4
    rep ret
.L4:
    jmp     foo()

But back to the int version... it's neat ... for a compiler, but it still feels somewhat...

Wait, does it do the correct division of 3 and then multiplies it back and then checks if the equal number was calculated? Hmmm... But we want just mod 3 == 0 test, right?

So then this should do too:

bar(int):
    mov     eax, edi
    mov     edx, 1431655766
    imul    edx
; don't +1 of div result for negative edi
;    mov     eax, edi  
;    sar     eax, 31
;    sub     edx, eax
; instead do +3 for all results
    lea     eax, [rdx+rdx*2+3]
    sub     eax, edi   ; eax = 0, 1, 2 or 3 (!)
    jpe     .L4        ; so parity-even condition will resolve it (!)
    rep ret
.L4:
    jmp     foo()

Unfortunately this is special case "mod 3" optimization only (and I'm not sure whether this really improves anything)... :) But I couldn't resist to "golf" it a bit.

Ped7g
  • 16,236
  • 3
  • 26
  • 63
  • 1
    Your first snippet is still for `int`. With an `unsigned int` argument, it substitutes `mul` for `imul` and elides the `sub`. Perhaps you copied and pasted the wrong overload? – Cody Gray - on strike Dec 06 '16 at 14:52
  • @CodyGray yeah, thank you, I sort of got confused halfway into editing this, and mixed up two separate train of thoughts... should make sense now. – Ped7g Dec 06 '16 at 15:10
  • Now I wonder, why I bothered with `sub`, the `cmp` would update PF too, right? – Ped7g Dec 06 '16 at 15:15
  • Yeah, a `cmp` would update flags in exactly the same way as a `sub`. But you aren't doing a `cmp` here anyway, so there's no performance win in swapping a `sub` for a `cmp`. The only advantage would be if you needed to preserve the destination register. (Actually, you can't use a `cmp` here, because you need the result to be in `eax` so you can return it!) – Cody Gray - on strike Dec 06 '16 at 16:58