0

While benchmarking two functions I noticed that the execution times do not make sense. In one function I am returning the argument, while calculating the sqrt of the argument in the other.

Isolated Source:

#include <math.h>
#include <stdio.h>
#include <time.h>

float impl(unsigned long n) { return n; }

float impl_sqrt(unsigned long n) { return sqrt(n); }

void bench(float (*fun)(unsigned long), size_t iter, unsigned long n) {
    struct timespec begin, end;
    clock_gettime(CLOCK_MONOTONIC, &begin);

    for (size_t i = 0; i < iter; i++) {
        (*fun)(n);
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    double tdiff =
        (end.tv_sec - begin.tv_sec) + 1e-9 * (end.tv_nsec - begin.tv_nsec);
    printf("Time %f seconds\n", tdiff);
}

int main(void) {
    const size_t iter = 100000000;
    fputs("Constant\n", stderr);
    bench(&impl, iter, 271);
    fputs("Sqrt\n", stderr);
    bench(&impl_sqrt, iter, 271);
}

Compiler: Apple clang version 14.0.3 (clang-1403.0.22.14.1) Target: arm64-apple-darwin22.1.0 Thread model: posix

aarch64 Apple M1 Pro

On macOS with clang -O0

$ ./a.out
Constant
Time 0.216895 seconds
Sqrt
Time 0.216063 seconds

On macOS with clang -O0 -fno-builtin-sqrt

$ ./a.out
Constant
Time 0.216224 seconds
Sqrt
Time 0.253717 seconds

How is it possible that returning a constant is slower than calculating its sqrt? Looking at the disassembled source I cannot see any optimisation done by the compiler that would justify the performance difference.

Executing the program with time stopping yields unexpected results. How is this optimisation done, and where is it done?

Here parts of the disassembled binary from ghidra:

int _bench(code *param_1,ulong param_2,undefined8 param_3)

{
  int iVar1;
  ulong local_50;
  long local_48;
  long local_40;
  long local_38;
  long local_30;
  undefined8 local_28;
  ulong local_20;
  code *local_18;
  
  local_28 = param_3;
  local_20 = param_2;
  local_18 = param_1;
  _clock_gettime(6,&local_38);
  for (local_50 = 0; local_50 < local_20; local_50 = local_50 + 1) {
    (*local_18)(local_28);
  }
  _clock_gettime(6,&local_48);
  NEON_fmadd(0x3e112e0be826d695,(double)(local_40 - local_30),(double)(local_48 - local_38));
  iVar1 = _printf("Time %f seconds\n");
  return iVar1;
}
                             **************************************************************
                             *                          FUNCTION                          *
                             **************************************************************
                             undefined _bench()
             undefined         w0:1           <RETURN>
             undefined8        Stack[-0x10]:8 local_10                                XREF[2]:     100003e08(W), 
                                                                                                   100003ebc(*)  
             undefined8        Stack[-0x18]:8 local_18                                XREF[2]:     100003e10(W), 
                                                                                                   100003e48(R)  
             undefined8        Stack[-0x20]:8 local_20                                XREF[2]:     100003e14(W), 
                                                                                                   100003e34(R)  
             undefined8        Stack[-0x28]:8 local_28                                XREF[2]:     100003e18(W), 
                                                                                                   100003e4c(R)  
             undefined8        Stack[-0x30]:8 local_30                                XREF[1]:     100003e88(R)  
             undefined8        Stack[-0x38]:8 local_38                                XREF[1]:     100003e78(R)  
             undefined8        Stack[-0x40]:8 local_40                                XREF[1]:     100003e84(R)  
             undefined8        Stack[-0x48]:8 local_48                                XREF[1]:     100003e74(R)  
             undefined8        Stack[-0x50]:8 local_50                                XREF[4]:     100003e28(W), 
                                                                                                   100003e30(R), 
                                                                                                   100003e58(R), 
                                                                                                   100003e60(W)  
             undefined8        Stack[-0x58]:8 local_58                                XREF[2]:     100003ea0(W), 
                                                                                                   100003ea4(R)  
             undefined8        Stack[-0x60]:8 local_60                                XREF[1]:     100003eac(W)  
                             _bench                                          XREF[3]:     Entry Point(*), 
                                                                                          entry:100003f14(c), 
                                                                                          entry:100003f3c(c)  
       100003e04 ff 83 01 d1     sub        sp,sp,#0x60
       100003e08 fd 7b 05 a9     stp        x29,x30,[sp, #local_10]
       100003e0c fd 43 01 91     add        x29,sp,#0x50
       100003e10 a0 83 1f f8     stur       x0,[x29, #local_18]
       100003e14 a1 03 1f f8     stur       x1,[x29, #local_20]
       100003e18 a2 83 1e f8     stur       x2,[x29, #local_28]
       100003e1c c0 00 80 52     mov        w0,#0x6
       100003e20 e1 a3 00 91     add        x1,sp,#0x28
       100003e24 4b 00 00 94     bl         _clock_gettime                                   undefined _clock_gettime()
       100003e28 ff 0b 00 f9     str        xzr,[sp, #local_50]
       100003e2c 01 00 00 14     b          LAB_100003e30
                             LAB_100003e30                                   XREF[2]:     100003e2c(j), 100003e64(j)  
       100003e30 e8 0b 40 f9     ldr        x8,[sp, #local_50]
       100003e34 a9 03 5f f8     ldur       x9,[x29, #local_20]
       100003e38 08 01 09 eb     subs       x8,x8,x9
       100003e3c e8 37 9f 1a     cset       w8,cs
       100003e40 48 01 00 37     tbnz       w8,#0x0,LAB_100003e68
       100003e44 01 00 00 14     b          LAB_100003e48
                             LAB_100003e48                                   XREF[1]:     100003e44(j)  
       100003e48 a8 83 5f f8     ldur       x8,[x29, #local_18]
       100003e4c a0 83 5e f8     ldur       x0,[x29, #local_28]
       100003e50 00 01 3f d6     blr        x8
       100003e54 01 00 00 14     b          LAB_100003e58
                             LAB_100003e58                                   XREF[1]:     100003e54(j)  
       100003e58 e8 0b 40 f9     ldr        x8,[sp, #local_50]
       100003e5c 08 05 00 91     add        x8,x8,#0x1
       100003e60 e8 0b 00 f9     str        x8,[sp, #local_50]
       100003e64 f3 ff ff 17     b          LAB_100003e30
                             LAB_100003e68                                   XREF[1]:     100003e40(j)  
       100003e68 c0 00 80 52     mov        w0,#0x6
       100003e6c e1 63 00 91     add        x1,sp,#0x18
       100003e70 38 00 00 94     bl         _clock_gettime                                   undefined _clock_gettime()
       100003e74 e8 0f 40 f9     ldr        x8,[sp, #local_48]
       100003e78 e9 17 40 f9     ldr        x9,[sp, #local_38]
       100003e7c 08 01 09 eb     subs       x8,x8,x9
       100003e80 02 01 62 9e     scvtf      d2,x8
       100003e84 e8 13 40 f9     ldr        x8,[sp, #local_40]
       100003e88 e9 1b 40 f9     ldr        x9,[sp, #local_30]
       100003e8c 08 01 09 eb     subs       x8,x8,x9
       100003e90 01 01 62 9e     scvtf      d1,x8
       100003e94 08 00 00 90     adrp       x8,0x100003000
       100003e98 00 c1 47 fd     ldr        d0,[x8, #0xf80]=>DAT_100003f80                   = 3E112E0BE826D695h
       100003e9c 00 08 41 1f     fmadd      d0,d0,d1,d2
       100003ea0 e0 07 00 fd     str        d0,[sp, #local_58]
       100003ea4 e0 07 40 fd     ldr        d0,[sp, #local_58]
       100003ea8 e8 03 00 91     mov        x8,sp
       100003eac 00 01 00 fd     str        d0,[x8]=>local_60
       100003eb0 00 00 00 90     adrp       x0,0x100003000
       100003eb4 00 20 3e 91     add        x0=>s_Time_%f_seconds_100003f88,x0,#0xf88        = "Time %f seconds\n"
       100003eb8 2c 00 00 94     bl         _printf                                          int _printf(char * param_1, ...)
       100003ebc fd 7b 45 a9     ldp        x29=>local_10,x30,[sp, #0x50]
       100003ec0 ff 83 01 91     add        sp,sp,#0x60
       100003ec4 c0 03 5f d6     ret


float _impl(ulong param_1)

{
  return (float)(unkuint9)param_1;
}
                             //
                             // __text 
                             // __TEXT
                             // ram:100003dbc-ram:100003f4f
                             //
                             **************************************************************
                             *                          FUNCTION                          *
                             **************************************************************
                             undefined _impl()
             undefined         w0:1           <RETURN>
             undefined8        Stack[-0x8]:8  local_8                                 XREF[2]:     100003dc0(W), 
                                                                                                   100003dc4(R)  
                             _impl                                           XREF[2]:     Entry Point(*), 
                                                                                          entry:100003f08(*)  
       100003dbc ff 43 00 d1     sub        sp,sp,#0x10
       100003dc0 e0 07 00 f9     str        x0,[sp, #local_8]
       100003dc4 e0 07 40 fd     ldr        d0,[sp, #local_8]
       100003dc8 08 00 66 9e     fmov       x8,d0
       100003dcc 00 01 23 9e     ucvtf      s0,x8
       100003dd0 ff 43 00 91     add        sp,sp,#0x10
       100003dd4 c0 03 5f d6     ret

float _impl_sqrt(undefined8 param_1)

{
  double dVar1;
  
  dVar1 = (double)NEON_ucvtf(param_1);
  return (float)SQRT(dVar1);
}
                             **************************************************************
                             *                          FUNCTION                          *
                             **************************************************************
                             undefined _impl_sqrt()
             undefined         w0:1           <RETURN>
             undefined8        Stack[-0x8]:8  local_8                                 XREF[2]:     100003df4(W), 
                                                                                                   100003df8(R)  
                             _impl_sqrt                                      XREF[2]:     Entry Point(*), 
                                                                                          entry:100003f44(*)  
       100003df0 ff 43 00 d1     sub        sp,sp,#0x10
       100003df4 e0 07 00 f9     str        x0,[sp, #local_8]
       100003df8 e0 07 40 fd     ldr        d0,[sp, #local_8]
       100003dfc 00 d8 61 7e     ucvtf      d0,d0
       100003e00 00 c0 61 1e     fsqrt      d0,d0
       100003e04 00 40 62 1e     fcvt       s0,d0
       100003e08 ff 43 00 91     add        sp,sp,#0x10
       100003e0c c0 03 5f d6     ret


undefined8 entry(void)

{
  undefined *puVar1;
  
  puVar1 = PTR____stderrp_100004000;
  _fputs("Constant\n",*(FILE **)PTR____stderrp_100004000);
  _bench(_impl,100000000);
  _fputs("Sqrt\n",*(FILE **)puVar1);
  _bench(_impl_sqrt,100000000,0x10f);
  return 0;
}

hmelder
  • 13
  • 2
  • 1
    _"How is it possible that returning a constant is faster than calculating its sqrt?"_ - are you aware of just-how _expensive_ `sqrt` is to compute? Take any handheld calculator with a Sqrt button: if you were to use that button on an arbitrary value you'll notice the calculator completely freeze-up for a good 1-2 seconds. Read this: https://scicomp.stackexchange.com/questions/2168/what-is-the-computational-cost-of-sqrtx-in-standard-libraries – Dai Jul 11 '23 at 13:35
  • 3
    *How is it possible that returning a constant is faster than calculating its sqrt?* Hmmm. How would it ever be possible to *not be faster* if you do basically nothing compared to do some calculations? – Gerhardh Jul 11 '23 at 13:38
  • > are you aware of just-how expensive sqrt is to compute? Yes, but it would be interesting how it is optimised away in *O0*, as it could impact further benchmarking. – hmelder Jul 11 '23 at 13:39
  • @hmelder Benchmarking built-in libraries is a fool's errand, especially with optimizations disabled. You're wasting your time. – Dai Jul 11 '23 at 13:40
  • I suggest you compare the library `sqrt()` with another way of computing a √ instead of using a dummy function that loses precision from the passed vlaue. – Weather Vane Jul 11 '23 at 13:45
  • I suspect the same thing. Maybe the time measurement is just to inaccurate. – hmelder Jul 11 '23 at 13:46
  • I have annotated the decompiled binary. – hmelder Jul 11 '23 at 13:47
  • 3
    `-O0` makes the loop-counter increment a bottleneck ([load/add/store](https://stackoverflow.com/questions/53366394/why-does-clang-produce-inefficient-asm-with-o0-for-this-simple-floating-point) including [store-forwarding latency](https://stackoverflow.com/questions/49189685/adding-a-redundant-assignment-speeds-up-code-when-compiled-without-optimization)), a worse bottleneck than square-root throughput on modern CPUs, so they run at the same speed at `-O0` thanks to out-of-order exec. Unless you make the square-root function even less efficient by making it actually call the libm function. – Peter Cordes Jul 11 '23 at 13:47
  • Re: your update: that's not disassembly, that's decompilation back into C. The only thing that tells us which we didn't already know from `-O0` and the original source is that you're on an ARM CPU. Actually you did already mention that in text. – Peter Cordes Jul 11 '23 at 13:50
  • 1
    Is your actual question why `sqrt(n)` is *not* slower in the first test? Because it should be obvious that doing more work like `return sqrt(n)` instead of just `return n` can be slower, that would be the normal expectation. Your actual question title is the opposite, like you expected `return n` (which isn't a constant BTW) to be slower. Benchmarking at `-O0` is usually a mistake; store-forwarding latency often becomes a bottleneck. – Peter Cordes Jul 11 '23 at 13:53
  • @hmelder - This is my favorite "Usain Bolt walking" analogy - using `-O0` is asking the compiler **not** to make the code run fast. Then you check how fast it runs (walks)... – BoP Jul 11 '23 at 13:53
  • @PeterCordes I expected sqrt to be slower then n – hmelder Jul 11 '23 at 13:57
  • @PeterCordes I mean the clang --version output stated that I was running on aarch64. But forget to mention it explicitly. – hmelder Jul 11 '23 at 13:59
  • *I expected sqrt to be slower then n* - Good; that's what your benchmark shows when you make `sqrt` extra expensive by disabling inlining of it. So like I said, is your actual question why it's *not* slower in the first case? You haven't showed a test that found it being actually faster than just returning `n`. (0.216895 and 0.216063 are pretty much equal, within measurement noise and startup overhead like CPU frequency warm-up since you're testing it first within the same program without other warm-up; [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)) – Peter Cordes Jul 11 '23 at 14:04
  • I assume the `_impl_sqrt` assembly listing is for the `-fno-builtin-sqrt` compilation? Can you also post the `_impl_sqrt` assembly listing for the default `-fbuiltin-sqrt` compilation? – Ian Abbott Jul 11 '23 at 14:05
  • @IanAbbott My bad. Updated the post. – hmelder Jul 11 '23 at 14:10
  • 1
    @PeterCordes https://stackoverflow.com/q/60291987 is useful. Thank you very much! – hmelder Jul 11 '23 at 14:16
  • 1
    Unrelated: `(*fun)(n)` and `fun(n)` are equivalent. – pmg Jul 11 '23 at 14:17
  • 1
    @Dai It depends if *sqrt* is computed in hardware or software. If it is computed in hardware it is about the same cost in time as hardware division. – Ian Abbott Jul 11 '23 at 14:30

2 Answers2

1

How is it possible that returning a constant is faster than calculating its sqrt?

Note that impl_sqrt() does more than "calculating the sqrt of the argument".

float impl_sqrt(unsigned long n) {
  return sqrt(n);
}

The above converts the unsigned long to double, calls sqrt(), then converts the double to float before returning.

The below only needs to convert the unsigned long to float.

float impl(unsigned long n) { return n; }

To skip the double to float conversion (and perhaps perform a faster square root and identify optimization potential),

float impl_sqrt_alt1(unsigned long n) {
  return sqrtf(n);
}

Of course this is without high optimization that may be able to see only a single value is used.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
0

How is it possible that returning a constant is slower than calculating its sqrt?

The timings posted show that impl is actually faster than impl_sqrt, since it takes less time to perform 108 iterations when you tell the compiler to make no assumptions about the sqrt function. The linker however does expand the call to the sqrt() function as a single instruction.

If you let the compiler use its builtin knowledge of the sqrt() function, the call can be elided since the result is not used and the function is known to have no side effects.

The minuscule difference in timings for the first set of tests is well below the precision of this simplistic measurement: clock_gettime(CLOCK_MONOTONIC, &begin); returns an fairly good approximation of the wall clock time, so measuring the elapsed time also measures any other task the mac performs during the test, some of which may affect one test more than the other. Remember that macOS typically has more than 1000 threads active at any given time.

You should try and measure shorter tests and perform many measurements to find the smallest elapsed time and keep at most 2 significant digits.

Note also that benchmarking unoptimized code is not necessarily meaningful, nor representative of the performance of optimized code.

chqrlie
  • 131,814
  • 10
  • 121
  • 189