3

Is there a resource on how many cycles SIMD is on apple M1/M2? Like x86 https://uops.info/table.html or agner fog? I wish I could give a bigger bounty but that's all the rep I have

I never programmed on a ARM machine. I took a look at sse2neon
https://github.com/DLTcollab/sse2neon/blob/7bd15eac51e36bf7426052f8515358cb665d8c04/sse2neon.h

The first thing I looked up was setzero. I was doubting that dup was the way to go so I tried nanobench and saw xor was faster, and that sub itself wasn't the same.

Is there something I can look up to get a rough idea? My target is M2

#include <arm_neon.h>
#define ANKERL_NANOBENCH_IMPLEMENT
#include "nanobench.h"

int32x4_t setzeroA()
{
    return vdupq_n_s32(0);
}
int32x4_t setzeroB()
{
    int32x4_t v;
    return vsubq_u32(v, v);
}
uint8x16_t setzeroC()
{
    uint8x16_t v;
    return veorq_u8(v, v);
}

int main() {
    ankerl::nanobench::Bench().run("Set", [&] {
        auto v = setzeroA();
        ankerl::nanobench::doNotOptimizeAway(v);
    });
    ankerl::nanobench::Bench().run("sub", [&] {
        auto v = setzeroB();
        ankerl::nanobench::doNotOptimizeAway(v);
    });
    ankerl::nanobench::Bench().run("xor", [&] {
        auto v = setzeroC();
        ankerl::nanobench::doNotOptimizeAway(v);
    });
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Stan
  • 161
  • 8
  • 1
    Microbenchmarking is hard, and those intrinsics probably all compile to the same asm. (If not, your compiler did a bad job.) If you want to test zeroing idioms, you'll probably have to write in asm, or make sure your `doNotOptimizeAway` also makes the compiler think the value is clobbered so it has to re-materialize a zero. Although it could still do that with `mov` from another register, hoisting actual zeroing out of a loop if that's cheaper. – Peter Cordes Dec 07 '22 at 04:22
  • @PeterCordes I used clang, on the M2. I haven't tried to install gcc on it. They all give different numbers so yeah I guess it did a bad job. But I'm not sure what the rules are when you use intrinsics. I always thought the compiler doesn't optimize it. I've never read nanobench source but I like the printout it gives. It seems accurate – Stan Dec 07 '22 at 04:33
  • 1
    https://godbolt.org/z/MzEhdzfca shows clang15 using `movi v0.2d, #0000000000000000` for `setzeroA` and `setzeroC`, but not using any instructions for `setzeroB`, apparently it doesn't like that instance of read-uninitialized. Intrinsics aren't assembly language; just like your compiler might not use an `add` instruction for `4 + 5`, it might get the result of your intrinsic in a different way if it understands what it does. (clang does for most intrinsics.) – Peter Cordes Dec 07 '22 at 04:43
  • @PeterCordes You're right. I repeat the test and it appears whichever is at the end runs fastest. I wonder if it goes from efficient core -> larger core -> higher clock. You're also right with B being a no-op. I fixed my code to all use u8 and it appears to change nothing (its the same instructions). – Stan Dec 07 '22 at 14:55
  • That or idle vs. max frequency on the same core, if it doesn't run long enough to jump up to max frequency quickly. [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) Or both effects. – Peter Cordes Dec 07 '22 at 14:57
  • @PeterCordes have you used nanobench before? It suggest using py perf to tune the system and that gives me consistent results on my linux x64. I completely forgot about the pitfalls and have no idea whats sufficient to warn up on a mac. I'll try to figure out what you written, like how much cpu time is needed – Stan Dec 07 '22 at 15:26
  • No, I haven't used nanobench. I assume it's intended to run many different short benchmarks as part of a single program, so each individual test can assume that the CPU is already warmed up to max frequency. That would explain the results, at least. So yeah, just run a dummy loop for a few hundreds of million iterations at the start of your program, not touching much or any memory unless your benchmark will. Maybe one that uses some NEON instructions, if there are any separate warm-up effects for that. – Peter Cordes Dec 07 '22 at 15:37
  • @PeterCordes it's fun. It includes an err rate to show how much variance it had during the test. When it starts it tells you if turbo is on and other things that can cause the results to vary. It's a single header app, the homepage has a hello world like example https://github.com/martinus/nanobench I was looking through usr bin and noticed dappprof. Running with that gave me the same number across all my test. No warm up necessary – Stan Dec 07 '22 at 15:47

1 Answers1

3

These are from the M1, but I doubt anything major has changed with the M2.

Big: https://dougallj.github.io/applecpu/firestorm-simd.html

Little: https://dougallj.github.io/applecpu/icestorm-simd.html

robthebloke
  • 9,331
  • 9
  • 12
  • Yeah, there isn't as much variation in throughput and latency on the M1 as there is on intel - so expect many ops to have similar latencies. Yeah, the instruction you are looking for is FCMGT (which works on NEON simd registers - hence the argument being 'vector' in that documentation, as opposed to general purpose integer registers). From looking at the throughputs on both cores, it would appear that ice storm has 4x64bit execution ports for SIMD, vs 4x128bit on firestorm. Firestorm is about 3x quicker than Icestorm in general. – robthebloke Dec 07 '22 at 04:25
  • What's the difference between LAT and retire? Latency seems to be always larger or the same, and to me it means how long from running the instruction until I can use it. Retire I thought meant the same thing but specifically values retire (as oppose to NOP or x86 mm_pause which don't produce a value). I don't understand what retire means on that page – Stan Dec 08 '22 at 02:23
  • Retired instructions refers to the number of microcode instructions. LD2 with 4S (4xfloat) is a good example. The instruction loads 4x vec2, and converts them to SOA format: xxxx/yyyy. An alternative approach would be to use 2x vldq_f32 + vzip1q_f32 + vzip2q_f32, and you'd have the same thing. LD2 has a retire count of 4 microcode ops internally, so you can pretty sure its implemented using those 4 instructions I've just mentioned. The only real advantage for the LD2 op is that you get higher code density in your exe (less L1 instruction cache use). – robthebloke Dec 08 '22 at 05:03
  • Latency is just the number of cycles the CPU requires before the result from the instruction is available. FDIV takes a lot longer to execute than FADD for example... – robthebloke Dec 08 '22 at 05:05
  • (I should also mention that I'm a bit sceptical of the load/store latencies listed on that page). – robthebloke Dec 08 '22 at 05:13