Why is my assembly code much slower than the C implementation

Question

I am learning assembly. So I wrote a routine that returns the square root of its input if the input is non-negative, and it returns 0 otherwise.

I have implemented the routine in both assembly and C, I would like to understand why my C routines compiled with -O2 are much faster than my assembly routine. The disassembled code for the C routines look slightly more complex than my assembly routine, so I don't understand where I am going wrong.

The assembly routine (srt.asm) :

global srt
section .text
srt:
pxor xmm1,xmm1
comisd xmm0,xmm1
jbe  P
sqrtsd xmm0,xmm0
retq
P:
  pxor xmm0,xmm0
retq

I am compiling the above as

nasm -g -felf64 srt.asm

The C routines (srtc.c)

#include <stdio.h>
#include <math.h>
#include <time.h>
extern double srt(double);

double srt1(double x)
{
    return sqrt( (x > 0) * x );
}

double srt2(double x)
{
    if( x > 0) return sqrt(x);
    return 0;
}


int main(void)
{
    double v = 0;
    clock_t start;
    clock_t end;
    double niter = 2e8;


    start = clock();
    v = 0;
    for( double i = 0; i < niter; i++ ) {
        v += srt(i);
    }
    end = clock();
    printf("time taken srt = %f v=%g\n", (double) (end - start)/CLOCKS_PER_SEC,v);

    start = clock();
    v = 0;
    for( double i = 0; i < niter; i++ ) {
        v += srt1(i);
    }
    end = clock();
    printf("time taken srt1 = %f v=%g\n", (double) (end - start)/CLOCKS_PER_SEC,v);

    start = clock();
    v = 0;
    for( double i = 0; i < niter; i++ ) {
        v += srt2(i);
    }
    end = clock();
    printf("time taken srt2 = %f v=%g\n", (double) (end - start)/CLOCKS_PER_SEC,v);

    return 0;
}

The above is compiled as

gcc -g -O2 srt.o -o srtc srtc.c -lm

The output of the program is

time taken srt = 0.484375 v=1.88562e+12
time taken srt1 = 0.312500 v=1.88562e+12
time taken srt2 = 0.312500 v=1.88562e+12

So my assembly routine is significantly slower.

The disassembled C code is

Disassembly of section .text:

0000000000000000 <srt1>:
   0:   f3 0f 1e fa             endbr64 
   4:   66 0f ef c9             pxor   xmm1,xmm1
   8:   66 0f 2f c1             comisd xmm0,xmm1
   c:   77 04                   ja     12 <srt1+0x12>
   e:   f2 0f 59 c1             mulsd  xmm0,xmm1
  12:   66 0f 2e c8             ucomisd xmm1,xmm0
  16:   66 0f 28 d0             movapd xmm2,xmm0
  1a:   f2 0f 51 d2             sqrtsd xmm2,xmm2
  1e:   77 05                   ja     25 <srt1+0x25>
  20:   66 0f 28 c2             movapd xmm0,xmm2
  24:   c3                      ret    
  25:   48 83 ec 18             sub    rsp,0x18
  29:   f2 0f 11 54 24 08       movsd  QWORD PTR [rsp+0x8],xmm2
  2f:   e8 00 00 00 00          call   34 <srt1+0x34>
  34:   f2 0f 10 54 24 08       movsd  xmm2,QWORD PTR [rsp+0x8]
  3a:   48 83 c4 18             add    rsp,0x18
  3e:   66 0f 28 c2             movapd xmm0,xmm2
  42:   c3                      ret    
  43:   66 66 2e 0f 1f 84 00    data16 nop WORD PTR cs:[rax+rax*1+0x0]
  4a:   00 00 00 00 
  4e:   66 90                   xchg   ax,ax

0000000000000050 <srt2>:
  50:   f3 0f 1e fa             endbr64 
  54:   66 0f ef c9             pxor   xmm1,xmm1
  58:   66 0f 2f c1             comisd xmm0,xmm1
  5c:   66 0f 28 d1             movapd xmm2,xmm1
  60:   77 0e                   ja     70 <srt2+0x20>
  62:   66 0f 28 c2             movapd xmm0,xmm2
  66:   c3                      ret    
  67:   66 0f 1f 84 00 00 00    nop    WORD PTR [rax+rax*1+0x0]
  6e:   00 00 
  70:   66 0f 2e c8             ucomisd xmm1,xmm0
  74:   66 0f 28 d0             movapd xmm2,xmm0
  78:   f2 0f 51 d2             sqrtsd xmm2,xmm2
  7c:   76 e4                   jbe    62 <srt2+0x12>
  7e:   48 83 ec 18             sub    rsp,0x18
  82:   f2 0f 11 54 24 08       movsd  QWORD PTR [rsp+0x8],xmm2
  88:   e8 00 00 00 00          call   8d <srt2+0x3d>
  8d:   f2 0f 10 54 24 08       movsd  xmm2,QWORD PTR [rsp+0x8]
  93:   48 83 c4 18             add    rsp,0x18
  97:   66 0f 28 c2             movapd xmm0,xmm2
  9b:   c3                      ret

What exact CPU model did you test on? Did you do any warm-up runs of anything else, before this C program started, to get CPU frequency up to speed? ([Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)) — Peter Cordes, Jun 25 '21 at 22:41
And BTW, the reason for GCC's extra code when inlining `sqrt()` that conditionally calls into the libm function is because you didn't use `-fno-math-errno`. See [How to force GCC to assume that a floating-point expression is non-negative?](https://stackoverflow.com/q/57673825) - you should generally always use that — Peter Cordes, Jun 25 '21 at 22:42
I am running my code on WSL. /proc/cpuinfo shows the model name as i5-8365u I did run my code multiple times, I got similar results each time. I aslo ran the assembly version last, instead of running it first. The numbers were similar in all cases. — Arin Chaudhuri, Jun 25 '21 at 22:44
I tested on my i7-6700k Skylake (same microarchitecture as your Kaby Lake); I can repro the performance effect even with warm-up runs, so that doesn't seem to be it. Probably after inlining into the loop, GCC can optimize away some work; have a look at the actual asm for `main` since you didn't use `__attribute__((noinline,noclone))` on your C functions. — Peter Cordes, Jun 25 '21 at 22:45
Oh right, simply being a non-inline function is the problem. x86-64 System V doesn't have any call-preserved XMM registers, so the add dependency chain through `v` includes a store/reload for `srt()`, but not when srt1 or srt2 inline. — Peter Cordes, Jun 25 '21 at 22:49
After disabling inlining, I get similar performance. Thanks. If you make your comment an answer I will accept it. — Arin Chaudhuri, Jun 25 '21 at 23:00

score 1 · Accepted Answer · answered Jul 01 '21 at 17:36

Peter Cordes comment explains what is happening here. srt1 and srt2 are inlined while srt is not. Quoting Peter Cordes :

Oh right, simply being a non-inline function is the problem. x86-64 System V doesn't have any call-preserved XMM registers, so the add dependency chain through v includes a store/reload for srt(), but not when srt1 or srt2 inline

.

Why is my assembly code much slower than the C implementation

1 Answers1

Linked