1

I did a small benchmark on my M1 Macbook and I got a strange result. Intel binary runs faster than arm64 binary. What is wrong with my experiment?

$ arch
arm64
$ make
arch -x86_64 cc fib.c -o x86.out
arch -arm64 cc fib.c -o arm64.out
file *.out
arm64.out: Mach-O 64-bit executable arm64
x86.out:   Mach-O 64-bit executable x86_64
time ./x86.out
Fibonacci 45 is 1134903170
        8.32 real         7.54 user         0.01 sys
time ./arm64.out
Fibonacci 45 is 1134903170
       10.21 real         9.77 user         0.01 sys

$ cat fib.c
#include <stdio.h>
int fibonacci(int n);
int main() {
  int n = 45;
  // Print out number of characters in n
  printf("Fibonacci %i is %i\n", n, fibonacci(n));
  return 0;
}
int fibonacci(int n) {
  if (n == 0 || n == 1) {
    return n;
  } else {
    return fibonacci(n-1) + fibonacci(n-2);
  }
}

Per comments, I have enabled -O3 but Intel bin still runs a bit faster.

$ make
arch -x86_64 cc fib.c -o x86.out -O3
arch -arm64 cc fib.c -o arm64.out -O3
file *.out
arm64.out: Mach-O 64-bit executable arm64
x86.out:   Mach-O 64-bit executable x86_64
time ./x86.out
Fibonacci 45 is 1134903170
        3.46 real         3.33 user         0.00 sys
time ./arm64.out
Fibonacci 45 is 1134903170
        3.57 real         3.45 user         0.00 sys
anonaka
  • 85
  • 8
  • 3
    Looks like you don't have any optimization turned on. – Retired Ninja Jul 07 '21 at 03:00
  • 2
    x86-64 -> ARM64 binary translation is probably doing optimization, but `arch -arm64 cc` is making [anti-optimized / debug-mode](https://stackoverflow.com/questions/53366394/why-does-clang-produce-inefficient-asm-with-o0-for-this-simple-floating-point) native machine code that runs directly, without any later steps to fix that. – Peter Cordes Jul 07 '21 at 03:11
  • 2
    [edit] your question with your new info on `-O3` builds. That time difference barely looks statistically significant, just 3.46 real vs. 3.57 real. Is it repeatable even with warm-up runs to get the CPU clock speed and caches ramped up before you run? ([Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)) e.g. run `for i in {1..10}; do time ./a.out; done` and look at the fastest couple of runs for each binary. If there's a real effect, it's not big enough for the binary->binary optimizer to have replaced the dumb double-recursion... – Peter Cordes Jul 07 '21 at 03:16
  • 1
    This raises the question of how to see what ARM64 machine code the CPU is *actually* running when you execute the x86_64 binary. i.e. the results of dynamic translation. AFAIK MacOS uses a mostly(?) ahead-of-time translation, spending significant CPU time doing an optimizing translation once and caching it for reuse. – Peter Cordes Jul 07 '21 at 04:20
  • Also, is that still just the same single set of `-O3` runs you did before, without any of the warm-up / suggestions I mentioned? Running them in the opposite order can be useful. If your CPU is thermally limited, clock speed might drop off a tiny bit some time during testing, making the first one run faster in time if not cycles. So maybe monitor CPU frequency while this runs, or do a long warm-up run so your CPU reaches a steady-state. – Peter Cordes Jul 07 '21 at 04:29

1 Answers1

0

Rosetta has been remarkably efficient. In many benchmarks the M1 ran Intel code faster than any other Intel Mac could manage.

There isn't documentation directly addressing the question. But in general, Rosetta seems to be good at converting Intel code to Arm. So it is remotely possible that the translated version of the Intel binary is on par with the Arm native version. Especially considering the relatively minor different in measured time. That slim of margin in times may not accurately be measured.

James Risner
  • 5,451
  • 11
  • 25
  • 47
  • That's comparing the same Intel binary on an M1+Rosetta vs. an Intel CPU. This question is keeping the CPU constant, and comparing Intel binary+Rosetta vs. native AArch64 binary produced by XCode `cc -O3` (without `-march=native` or anything, in case that matters). Yes, it shows that Rosetta is efficient in general, but `cc` is clang (LLVM) which is also a good compiler. I wonder if this recursive Fibonacci is less efficient natively because of some calling-convention differences, like extra stuff in the stack frame required by Apple's native AArch64 convention? – Peter Cordes Oct 16 '22 at 02:38