Microbenchmarking of algorithm

Question

I want to compare the speed of functions in C. This is my approach.

I measure the time of each implementation by making the calculation on the input array a certain number of times and adding up the time of each iteration using the monotonic clock and clock_gettime.

    totalTime = 0.0;
    for (int i = 0; i < ITERATIONS; ++i) {
        startTime = (float)clock()/CLOCKS_PER_SEC;

        doFun(input, output);
        endTime = (float)clock()/CLOCKS_PER_SEC;

        timeElapsed = endTime - startTime;

        totalTime += timeElapsed;
    }
    printf("Time:       %f \n", totalTime);

-O0
Native:       26.296245 
Quick:   34.653388 
SSE:        21.701504

-O1
Native:       11.036241 
Quick:   7.885706 
SSE:        2.043576

-O2
Native:       0.000776 
Quick:   0.000795 
SSE:        0.000761

I am really unsure if this is even the right approach to compare the algorithm and alternatives.

Especially for -O0 I don't understand how the algorithm is slower than the native function although it should be up to 4 times faster and the SSE variant again ideally 4 times faster than that.

At what optimization level should I compare the algorithms to have meaningful results and why do I not get the expected times at -O0?

https://stackoverflow.com/questions/2349776/how-can-i-benchmark-c-code-easily — Mitch Wheat, Jul 06 '23 at 01:58
You may want `drand48` instead of `rand`. You may need to disassemble each function (yours and `sqrtf`) to see if the generated code is what you suppose (e.g. Carmack's version really does generate Carmack's version). What is the value of `ITERATIONS`? It should be enough to keep the total time below 1ms (the value of a timeslice). You should have an _outer_ loop to repeat the same test (with the same values) several times and take the _minimum_ time. That way, you eliminate the effects of timeslicing. — Craig Estey, Jul 06 '23 at 01:59
With 1,200,000 entries, you're _not_ benchmarking the algorithm itself. You're also benchmarking the memory system. Memory fetches may overshadow your results. So, I'd _definitely_ cut down the number of iterations. And, as I mentioned, you want to ensure that you're _not_ timesliced away (e.g. your process starts, runs a bit, is put to sleep while another [unrelated] process is run, and then your process resumes). — Craig Estey, Jul 06 '23 at 02:06
@CraigEstey: But don't cut the number of *total* iterations; instead use nested loops to do multiple passes over the same array so it can stay hot in L1d cache. (Perhaps using `asm("" ::: "memory")` in the outer loop to force the compiler to actually do the work each time, if it tries to do loop interchange.) — Peter Cordes, Jul 06 '23 at 02:50
@bigTuna1337 - related: [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) — Peter Cordes, Jul 06 '23 at 03:28
@PeterCordes What you said is what I meant (sigh ;-) To clarify, I'd say do N trials of M iterations each and keep the minimum trial value. I've got a few answers with variations of my standard benchmark code. Here's the first one I found: [How to optimize "Find all possible triangles without duplicates" code](https://stackoverflow.com/a/64922786/5382650) Additional: [How to implement a timer for every second with zero nanosecond with liburing?](https://stackoverflow.com/a/61716588/5382650) — Craig Estey, Jul 06 '23 at 03:31
Half of the text doesn't make sense after your edit, e.g. what native function? What SSE variant? You should roll it back. Also, benchmarking at `-O0` is usually meaningless, with more C statements making things slower due to store/reload overhead of everything being sort of like `volatile`. [How can I optimize these loops (with compiler optimization disabled)?](https://stackoverflow.com/q/32000917) / [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394) It's normal that `-O0` performance is unrelated to anything useful. — Peter Cordes, Jul 11 '23 at 19:56

score 2 · Answer 1 · answered Jul 06 '23 at 02:28

2

Benchmarking like this isn't really all that helpful. Your benchmark results for -O2 are very close together - the reason is that you're benchmarking your memory bandwidth, and not your CPU performance. Benchmarking -O0 and -O1 doesn't make a great deal of sense tbh. Also ensure you have -ffast-math enabled.

Anyhow, using boring old std::sqrt:

void test_sqrt(float* __restrict a, const float* __restrict b) {
    for(int i = 0; i < 256; ++i) {
        a[i] = 1.0f / sqrtf(b[i]);
    }
}

It unrolls the loop so that each iteration 32 floats are processed (I've only included the asm for 8 of those floats).

        // load
        vmovups ymm2, ymmword ptr [rsi + 4*rax]

        // 1/sqrt
        vrsqrtps        ymm3, ymm2 // latency 4, TP = 1

        // refine
        vmulps  ymm2, ymm2, ymm3 // latency 4, TP = 2
        vfmadd213ps     ymm2, ymm3, ymm0 // latency 4, TP = 2
        vmulps  ymm3, ymm3, ymm1 // latency 4, TP = 2
        vmulps  ymm2, ymm3, ymm2 // latency 4, TP = 2

        // output
        vmovups ymmword ptr [rdi + 4*rax], ymm2

Things to note:

The compiler has removed the division, and is using rsqrt + a couple of newton-raphson iterations to refine the result.
The compiler has vectorised the code to be 8 wide (so an improvement over your SSE version).

The quake3 rsqrt (original version):

        vmovdqu ymm3, ymmword ptr [rsi + 4*rcx]
    
        vpsrad  ymm7, ymm3, 1 // latency 4, TP = 1
        vpsubd  ymm7, ymm0, ymm7  // latency 1, TP = 3
        vmulps  ymm3, ymm3, ymm1  // latency 4, TP = 2
        vmulps  ymm11, ymm7, ymm7  // latency 4, TP = 2
        vfnmadd213ps    ymm11, ymm3, ymm2   // latency 4, TP = 2
        vmulps  ymm3, ymm11, ymm7  // latency 4, TP = 2

        vmovups ymmword ptr [rdx + 4*rcx], ymm3

Again, the compiler has vectorised this code to use 8 wide AVX ops.

The two versions from a performance POV are almost identical (the Quake3 version has 1 additional instruction).

The big difference though is wrt to the accuracy. The quake 3 version is not brilliant on that front.

If you want absolute performance, use vrsqrtps. It's one instruction, and it has more accuracy than the Quake3 routine.

void test_rsqrt(float* __restrict a, const float* __restrict b) {
    for(int i = 0; i < 256; i += 8) {
        __m256 A = _mm256_load_ps(b + i);
        __m256 B = _mm256_rsqrt_ps(A);
        _mm256_store_ps(a + i, B);
    }
}

answered Jul 06 '23 at 02:28

robthebloke

9,331
9
12

Related: [Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64?](https://stackoverflow.com/q/71608126) (no, obsoleted on x86 by `rsqrtps` as you mention, and I think some other ISAs also have a fast approximate reciprocal square root. Ironic because SSE2 makes it more efficient to do integer operations on FP bit-patterns than with legacy x87 (where store / reload is required), although compilers usually don't take advantage and will still `movd eax, xmm0` and back for integer ops.) – Peter Cordes Jul 06 '23 at 03:21
It only takes one Newton-Raphson iteration to refine the ~12-bit `vrsqrtp` result to about 24-bit, so that's all `-ffast-math` does. [Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision](https://stackoverflow.com/q/31555260) - (Basically full precision for `float`, but IIRC the max error is at least 1ulp, maybe higher, vs. 0.5 ulp for `vsqrtps`. But if you were also doing 1.0/sqrt(x), that would have two steps, two sources of rounding error.) – Peter Cordes Jul 06 '23 at 03:27
@bigTuna1337: What compiler are you using? clang auto-vectorizes at `-O2`. GCC auto-vectorizes at `-O3` (except for very simple and cheap cases at `-O2`) . Also what CPU? On Intel Skylake for example, max (exact) square root throughput is achievable with 128-bit vectors: the CPU can use both halves of the 256-bit SIMD square-root unit separately or something. (https://uops.info/) – Peter Cordes Jul 06 '23 at 12:12
@bigTuna1337: In that paper you linked from 2012, the newest CPU they used was a Sandybridge i7-2640M. That doesn't support FMA instructions (so a Newton iteration is more costly), but also has much worse throughput for `vsqrtps` and `vdivps` than newer CPUs like Skylake. They also might not have used `-march=native` to allow AVX. Or the compiler couldn't use 256-bit vectors anyway because without AVX2, the integer operations aren't available for that width. – Peter Cordes Jul 06 '23 at 12:19
@bigTuna1337: At `-O2` you're not benchmarking memory bandwidth, you're probably benchmarking nothing at all, with the work optimized away. A factor of 2700 speedup from `-O2` vs. `-O1` is not plausible without optimizing away huge amounts of work. Auto-vectorization can't explain that. The times probably don't scale with the problem size anymore either. See [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) which mentions some tricks to make sure an array of results isn't optimized away, e.g. reading `arr[argc]`. – Peter Cordes Jul 06 '23 at 12:21
@bigTuna1337: If you haven't installed actual GCC on a Mac, the `gcc` command is actually Clang (check with `gcc --version`), which auto-vectorizes at `-O2`. Clang's `-O3` does sometimes make different choices, but it's not surprising in this simple case that it didn't find anything to do differently. – Peter Cordes Jul 06 '23 at 17:57
@bigTuna1337: Yes, `sqrtps` and `divps` were slower *relative to* other instructions, in throughput cost, on Sandybridge. e.g. 1 per 10 cycles for `vdivps xmm` or `vsqrtps xmm` on Sandybridge, and those both compete for the same execution unit. vs. still having 1/clock `vrsqrtps xmm` and fully pipelined vector-integer stuff. So yes, div and sqrt (and thus `1.0f/sqrtf`) were more expensive relative to any multi-instruction alternatives, which mostly stayed about the same speed on Sandybridge vs. Skylake, except for FMA being possible which helps for the Newton iteration if you do one. – Peter Cordes Jul 06 '23 at 18:01
@bigTuna1337: The BSc thesis you found claiming Quake's reciprocal square root is still useful is ignoring the existence of hardware `rsqrt` which makes it totally obsolete. If you didn't have that (e.g. in a portable program without specializations for different ISAs), yes it could be useful, but if you're manually vectorizing with SSE you definitely do have `rsqrtps`. I don't know of a good portable library or language that exposes fast approximate math functions with ISA-specific optimizations, so you either have to optimize yourself or are at the mercy of stuff like `gcc -ffast-math` – Peter Cordes Jul 06 '23 at 18:07
@bigTuna1337: You should update your question with the new `-O2` numbers so it's easier to compare them. Also, if you didn't shrink the array and introduce an outer loop like Craig and I discussed in comments on the question, your faster versions might be bottlenecking on memory copying, reading the old array and writing the new one. With just `vrsqrtps xmm` without a Newton iteration (similar precision to Q3 with a Newton iter, IIRC?), you can probably exceed L3 bandwidth, and certainly DRAM. And if you're getting page-faults from the first write to newly malloced memory, that's very slow. – Peter Cordes Jul 06 '23 at 18:10

Microbenchmarking of algorithm

1 Answers1