Why is inline assembler SOMETIMES faster and SOMETIMES slower in this code? Execution time varies quite a bit with each run

Question

I have written some C++ code to test to time C++ and in-line assembler code. I was initially just having fun with it, but then I noticed that every time I ran my program, I got different results. Sometimes the C++ was faster, sometimes the inline assembler code was faster, and sometimes they were all the same.

What is going on here?

Here's the code with the program output:

#define TRIALS 1000000
#include <iostream>
using namespace std;
typedef std::chrono::high_resolution_clock Clock;
int main()
{
  auto t1 = Clock::now();
  auto t2 = Clock::now();
  int X3=17;
  int X2=17;
  int X4=17;
  int X=17;



  int sum=0;
  int avg=0;
  cout << "=================================" << endl;
  cout << "| var*=10;                      |" << endl;
  cout << "=================================" << endl;

  for( int i=0; i<TRIALS; i++ )
    {
      X3=17;
      t1 = Clock::now();  
      X3*=10;
      t2 = Clock::now();
      sum+=chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
    }
  avg=sum/TRIALS;
  cout << "| Product:  " << X3<< "  "<< avg << " nanoseconds |" << endl;
  cout << "=================================" << endl;
  cout << endl << endl;

  avg=sum=0;
  cout << "=================================" << endl;
  cout << "| use inline assembler with shl |" << endl;
  cout << "=================================" << endl;

  for( int i=0; i<TRIALS; i++ )
    {
      X=17;
      t1 = Clock::now();
      asm /*volatile*/ (
            "movl %0, %%eax;" // X->ax
            "shll %%eax;"// ax*=2
            "movl %%eax, %%ebx;" // ax->bx
            "shll %%eax;" // ax*=2
            "shll %%eax;" // ax*=2
            "add %%ebx, %%eax;" // bx+ax->ax
            : "=a" (X)
            : "a" (X)
            : "%ebx"
            );
      t2 = Clock::now();
      sum+=chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
    }
  avg=sum/TRIALS;
  cout << "| Product:  " << X << "  "<< avg << " nanoseconds |" << endl;
  cout << "=================================" << endl;
  cout << endl << endl;
  avg=sum=0;

  cout << "=================================" << endl;
  cout << "| var=var*10                    |" << endl;
  cout << "=================================" << endl;

  for( int i=0; i<TRIALS; i++ )
    {
      X2=17;
      t1 = Clock::now();
      X2=X2*10;
      t2 = Clock::now();
      sum+=chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
    }
  avg=sum/TRIALS;
  cout << "| Product:  " << X3<< "  "<< avg << " nanoseconds |" << endl;
  cout << "=================================" << endl;
  cout << endl << endl;

  avg=sum=0;


  cout << "=================================" << endl;
  cout << "| use inline assembler with mul |" << endl;
  cout << "=================================" << endl;
  for( int i=0; i<TRIALS; i++ )
    {
      X4=17;
      t1 = Clock::now();
      asm  (
    "movl %0, %%eax;" // X->ax
    "movl $0x0A, %%ebx;" // 10->bx
    "mul %%ebx;" // 10*ax->ax
    : "=a" (X4)
    : "a" (X4)
    : "%ebx"
    );
      t2 = Clock::now();
      sum+=chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
    }
  avg=sum/TRIALS;
  cout << "| Product:  " << X4<< "  "<< avg << " nanoseconds |" << endl;
  cout << "=================================" << endl;
  cout << endl;

  return(0);
}

PROGRAM OUTPUT #1:

=================================
| var*=10;                      |
=================================
| Product:  170  50 nanoseconds |
=================================


=================================
| use inline assembler with shl |
=================================
| Product:  170  50 nanoseconds |
=================================


=================================
| var=var*10                    |
=================================
| Product:  170  50 nanoseconds |
=================================


=================================
| use inline assembler with mul |
=================================
| Product:  170  50 nanoseconds |
=================================

OUTPUT #2:

=================================
| var*=10;                      |
=================================
| Product:  170  62 nanoseconds |
=================================


=================================
| use inline assembler with shl |
=================================
| Product:  170  57 nanoseconds |
=================================


=================================
| var=var*10                    |
=================================
| Product:  170  59 nanoseconds |
=================================


=================================
| use inline assembler with mul |
=================================
| Product:  170  58 nanoseconds |
=================================

1) always only benchmark code built with optimizations enabled. 2) don't write inline asm, your compiler can almost always do it better (if the optimizer is enabled). 3) fluctuations in the nanosecond/millisecond range could easily be caused by the environment the program runs in rather than the program itself. — Jesper Juhl, Feb 28 '19 at 17:45
Your going to get different results every time you run the code as your computer is in a different state every time you run it. — NathanOliver, Feb 28 '19 at 17:46
@NathanOliver is correct. Even a very basic process can make a 12 ns difference in other processes. — JSON, Feb 28 '19 at 17:54
That's a pretty useless benchmark if just the overhead from the time measurement exceeds the cost of the payload by several magnitudes. Btw, it's the syscall behind ` Clock::now()` you profiled mostly. — Ext3h, Feb 28 '19 at 17:55
Sounds like duplicate of https://stackoverflow.com/questions/53252050/why-does-the-call-latency-on-clock-gettimeclock-realtime-vary-so-much and https://stackoverflow.com/questions/13772567/get-cpu-cycle-count. — Hadi Brais, Feb 28 '19 at 18:41
Thanks folks - I thought the environment might be the culprit. Hadi: It's not really a duplicate of those q's because I'm asking if folks knew why there was such fluctuation in the amounts of time: In any given run of the program, any one of the four methods could be the fastest (or the same as others) or the slowest. Jasper: I wouldn't normally write inline assembler - I was just having some fun and trying to learn how its done. — Michael Pellegrino, Feb 28 '19 at 19:05
@JesperJuhl "Always only benchmark code built with optimizations enabled." Why is it bad to benchmark if the code doesn't have optimizations? — Rietty, Feb 28 '19 at 19:24
@Rietty Becaude debug builds of your standard library usually has tons of extra sanity checking code enbled in debug builds that are useful for debugging but can slow it down by orders of magnitude. Also, your own code will be significantly slower in unoptimized builds and benchmaking that is pointless since, often, most of the code slowing you down will be outright removed in optimized builds anyway. — Jesper Juhl, Feb 28 '19 at 19:51
Your hand-written asm amusingly horrible. (That's normal for beginners). The only options worth considering are `asm("imul $10, %1, %0" : "=r"(output) : "rm" (input))` (3 cycle latency on Intel or Ryzen), or an LEA + add to do `*5 *2` like `lea (%rax, %rax,4), %eax` / `add %eax, %eax`. (2 cycle latency on Intel, 2 uops. Or 3 cycle latency on AMD, where LEA with a scaled index has extra latency even if there's only 2 components). See https://agner.org/optimize/. Of course inline asm defeats constant-propagation optimizations, and compilers know the LEA trick already. — Peter Cordes, Mar 02 '19 at 06:31
@Peter Corder - clarification: The question isn't about how the code segments should be programmed, it's about WHY the segments of code of sometimes faster than the others and then sometimes they are slower - Your reply (that amusingly lacks people skills) does not answer the question. — Michael Pellegrino, Mar 03 '19 at 12:12
That's why I posted it as a comment, not an answer, since Ralf has already posted one. I thought you might be interested to know about the good options for multiplying in asm. (Avoid one-operand `mul` / `imul` unless you need the high-half output.) — Peter Cordes, Mar 03 '19 at 19:53
@Peter Corder - Thanks Peter. The actual routines are just something to measure - they don't do anything useful. — Michael Pellegrino, Mar 03 '19 at 20:34

score 1 · Answer 1 · answered Feb 28 '19 at 20:04

These are more like tips rather than 'just' a solution:

1) crank up TRAILS by orders of magnitudes to actually measure something in the seconds range

2) repeat measurement several times (n=100 or more) and take average (and error of average = rms/sqrt(n) if you care for statistics)

3) actually measure something you want to measure: at least put only the code you are interested in into the TRAILS loop, ie :

t1 = Clock::now();  
for( int i=0; i<TRIALS; i++ )
    {
     ... only code relevant for your calculation here ...
    }
t2 = Clock::now();
sum = chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();

4) and finally, consider the godbolt compiler explorer service https://godbolt.org/ where you may check the assembler output of your code for various optimizer settings. For a code as simple as yours (I tried) it just does (using -O3): mov eax,170 so you see: the compiler is smart, you can't easily beat him with inline assembler! And this is for sure the case for non-trivial examples.

Why is inline assembler SOMETIMES faster and SOMETIMES slower in this code? Execution time varies quite a bit with each run

1 Answers1