-1

I wrote a simple multiplication function in C, and another in assembly code, using GCC's "asm" keyword.

I took the execution time for each of them, and although their times are pretty close, the C function is a little faster than the one in assembly code.

I would like to know why, since I expected for the asm one to be faster. Is it because of the extra "call" (i don't know what word to use) to the GCC's "asm" keyword?

Here is the C function:

int multiply (int a, int b){return a*b;}

And here is the asm one in the C file:

int asmMultiply(int a, int b){  
    asm ("imull %1,%0;"
             : "+r" (a)           
             : "r" (b)
    );
    return a;
}

my main where I take the times:

int main(){
   int n = 50000;
   clock_t asmClock = clock();
   while(n>0){
       asmMultiply(4,5);
       n--;
    }

   asmClock = clock() - asmClock;  
   double asmTime = ((double)asmClock)/CLOCKS_PER_SEC; 

   clock_t cClock = clock();
   n = 50000;
   while(n>0){
       multiply(4,5);
       n--;
   }
   cClock = clock() - cClock;  
   double cTime = ((double)cClock)/CLOCKS_PER_SEC;  

  printf("Asm time: %f\n",asmTime);
  printf("C code time: %f\n",cTime);

Thanks!

Weirdo
  • 42
  • 9
  • 1
    Express how you measured time elapsed in program. – EsmaeelE Feb 24 '19 at 07:03
  • Maybe similar https://stackoverflow.com/questions/9601427/is-inline-assembly-language-slower-than-native-c-code – EsmaeelE Feb 24 '19 at 07:08
  • 1
    *"Is it because of the extra "call" (i don't know what word to use) to the GCC's "asm" keyword?"* - no, it's because your asm is slow. Don't compete with compiler over trivial code, they will beat you in 99% cases with perfect machine code (often optimized so well, that it may confuse you and **look** slow - if your machine-knowledge is not at required expertise, and you may have some naive assumptions how modern x86 works). Bump your machine knowledge (assuming beginner level by your question wording and content, get to "expert" or "master") and use some medium complex C source = you can win. – Ped7g Feb 24 '19 at 07:45
  • You can go to e.g. https://godbolt.org/ and paste in your code (set optimization to -O3). You'll soon see that the compiler generates "better" code than you. That's normal these days - long gone are the days where it was easy to beat the compiler. – Support Ukraine Feb 24 '19 at 07:46
  • BTW in 32b you can get similar performance as C with `... { asm ("imull %1,%0;" : "+r" (a) : "m" (b) ); return a; }` .. in 64b with `... { asm ("imull %1,%0;" : "+r" (a) : "r" (b) ); return a; }` and I'm not sure if there's simple way how to unify these two, nor I'm 100% sure I got the constraints correct and there's no UB lingering somewhere, waiting to bite back when the source will get more complex. (this is just to show that you may eventually reach similar level of machine code with inline assembly, but it's more to show you how painful it is than as "solution" or "advice" how to do it). – Ped7g Feb 24 '19 at 08:00
  • related: [C++ code for testing the Collatz conjecture faster than hand-written assembly - why?](//stackoverflow.com/a/40356449). If either `a` or `b` are compile-time constants after inlining, the multiply might actually be done with an LEA. – Peter Cordes Feb 24 '19 at 18:41
  • 1
    @Ped7g: give the compiler a choice of register, memory, or immediate with `asm("imul %1, %0" : "+r"(a) : "rme"(b) );` GCC is good at this, but clang will usually choose memory if it's an option, even if that means spilling a register var first :/ But no, this still doesn't give you equal performance to C if either input was a constant that could be done with one `LEA` or shift, like `9` or `5`, or a power of 2. Or folded into an add as part of an LEA. (But potentially similar sure). **https://gcc.gnu.org/wiki/DontUseInlineAsm** – Peter Cordes Feb 24 '19 at 18:43
  • @Weirdo: after this edit, stand-alone versions of both functions should compile identically with optimization enabled. **How did you time them to find that your asm version was still slower**? Did you disable optimization? That might explain some extra cost, because the compiler isn't even trying to make fast code. – Peter Cordes Feb 24 '19 at 20:52
  • 1) asm isnt automatically faster than compiled C, you have to outperform the compiler for that to be true. 2) there are a lot of gotchas trying to time benchmarks, you have not provided enough information to show really anything about your question. what the compiler produced in each case how you timed it ti see what you were timing and if the measurement was the problem not the code under test, etc. 3) examine the compiler output for each of your cases, the answer should be right there, no timing tests required 4) try real asm. – old_timer Feb 24 '19 at 21:13
  • 1
    You keep editing your question with different benchmarking code, but you still haven't shown any actual time results, compiler version / options, or hardware info. C doesn't exist in a vacuum, the compiler version / options and hardware all matter. Your updated code still doesn't do anything to stop `multiply()` from optimizing away completely. See [CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!"](https://www.youtube.com/watch?v=nXaxk27zwlk) – Peter Cordes Feb 25 '19 at 02:01

2 Answers2

3

The assembly function is doing more work than the C function — it's initializing mult, then doing the multiplication and assigning the result to mult, and then pushing the value from mult into the return location.

Compilers are good at optimizing; you won't easily beat them on basic arithmetic.

If you really want improvement, use static inline int multiply(int a, int b) { return a * b; }. Or just write a * b (or the equivalent) in the calling code instead of int x = multiply(a, b);.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 1
    As the constraint for `mult` says `=`, the compiler actually knows that zero is discarded w/o usage, so it'll remove that `mult=0;` assignment. Where it does lose is the strict requirements for source registers and overall "blackboxing" of the simple `imul` instruction, making it impossible for optimizer to figure out what is going on, while it knows how to build "optimal" machine code for trivial multiplication really well (putting it against more complex calculation/source, the result will quickly divert from "optimal" levels to "very good" levels, giving human some chance, if enough effort) – Ped7g Feb 24 '19 at 08:07
  • 2
    Note that OP chose `ebx` for the source operand, mandating the compiler to save it before usage. – fuz Feb 24 '19 at 12:35
0

This attempt to microbenchmark is too naive in almost every way possible for you to get any meaningful results.

Even if you fixed the surface problems (so the code didn't optimize away), there are major deep problems before you can conclude anything about when your asm would be better than *.

(Hint: probably never. Compilers already know how to optimally multiply integers, and understand the semantics of that operation. Forcing it to use imul instead of auto-vectorizing or doing other optimizations is going to be a loss.)


Both timed regions are empty because both multiplies can optimize away. (The asm is not asm volatile, and you don't use the result.) You're only measuring noise and/or CPU frequency ramp-up to max turbo before the clock() overhead.

And even if they weren't, a single imul instruction is basically unmeasurable with a function with as much overhead as clock(). Maybe if you serialized with lfence to force the CPU to wait for imul to retire, before rdtsc... See RDTSCP in NASM always returns the same value

Or you compiled with optimization disabled, which is pointless.


You basically can't measure a C * operator vs. inline asm without some kind of context involving a loop. And then it will be for that context, dependent on what optimizations you defeated by using inline asm. (And what if anything you did to stop the compiler from optimizing away work for the pure C version.)

Measuring only one number for a single x86 instruction doesn't tell you much about it. You need to measure latency, throughput, and front-end uop cost to properly characterize its cost. Modern x86 CPUs are superscalar out-of-order pipelined, so the sum of costs for 2 instructions depends on whether they're dependent on each other, and other surrounding context. How many CPU cycles are needed for each assembly instruction?


The stand-alone definitions of the functions are identical, after your change to let the compiler pick registers, and your asm could inline somewhat efficiently, but it's still optimization-defeating. gcc knows that 5*4 = 20 at compile time, so if you did use the result multiply(4,5) could optimize to an immediate 20. But gcc doesn't know what the asm does, so it just has to feed it the inputs at least once. (non-volatile means it can CSE the result if you used asmMultiply(4,5) in a loop, though.)

So among other things, inline asm defeats constant propagation. This matters even if only one of the inputs is a constant, and the other is a runtime variable. Many small integer multipliers can be implemented with one or 2 LEA instructions or a shift (with lower latency than the 3c for imul on modern x86).

https://gcc.gnu.org/wiki/DontUseInlineAsm

The only use-case I could imagine asm helping is if a compiler used 2x LEA instructions in a situation that's actually front-end bound, where imul $constant, %[src], %[dst] would let it copy-and-multiply with 1 uop instead of 2. But your asm removes the possibility of using immediates (you only allowed register constraints), and GNU C inline can't let you use a different template for immediate vs. register arg. Maybe if you used multi-alternative constraints and a matching register constraint for the register-only part? But no, you'd still have to have something like asm("%2, %1, %0" :...) and that can't work for reg,reg.

You could use if(__builtin_constant_p(a)) { asm using imul-immediate } else { return a*b; }, which would work with GCC to let you defeat LEA. Or just require a constant multiplier anyway, since you'd only ever want to use this for a specific gcc version to work around a specific missed-optimization. (i.e. it's so niche that in practice you wouldn't ever do this.)


Your code on the Godbolt compiler explorer, with clang7.0 -O3 for the x86-64 System V calling convention:

# clang7.0 -O3   (The functions both inline and optimize away)
main:                                   # @main
    push    rbx
    sub     rsp, 16
    call    clock
    mov     rbx, rax                 # save the return value
    call    clock
    sub     rax, rbx                 # end - start time
    cvtsi2sd        xmm0, rax
    divsd   xmm0, qword ptr [rip + .LCPI2_0]
    movsd   qword ptr [rsp + 8], xmm0 # 8-byte Spill


    call    clock
    mov     rbx, rax
    call    clock
    sub     rax, rbx             # same block again for the 2nd group.

    xorps   xmm0, xmm0
    cvtsi2sd        xmm0, rax
    divsd   xmm0, qword ptr [rip + .LCPI2_0]
    movsd   qword ptr [rsp], xmm0   # 8-byte Spill
    mov     edi, offset .L.str
    mov     al, 1
    movsd   xmm0, qword ptr [rsp + 8] # 8-byte Reload
    call    printf
    mov     edi, offset .L.str.1
    mov     al, 1
    movsd   xmm0, qword ptr [rsp]   # 8-byte Reload
    call    printf
    xor     eax, eax
    add     rsp, 16
    pop     rbx
    ret

TL:DR: if you want to understand inline asm performance on this fine-grained level of detail, you need to understand how compilers optimize in the first place.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847