9

When running the attached sample program, the function tan appears to be twice as slow in context as opposed to when it is isolated. This is the output on my machine:

justtan(): ~16.062430 ns/iter
notan():   ~30.852820 ns/iter
withtan(): ~60.703100 ns/iter
empty():   ~0.355270 ns/iter

I would expect withtan() to be ~45ns or lower, given that it is a combination of justtan and notan.

I'm running macOS 11.5.2 with an Intel i7-4980HQ CPU. My cc --version is Apple clang version 13.0.0 (clang-1300.0.29.3). I've checked to make sure that disassembly for withtan and notan is identical except for the call to tan, and that clang is autovectorizing the loops with VEX instructions. I've also checked via a debugger that the version of tan which is called at runtime also utilizes VEX instructions to avoid the SSE-AVX2 transition penalty.

I compiled and ran the program in a Linux VM, and got a similar result (in the debugger, tan also uses AVX/VEX). Additionally, I ran it through cachegrind and found there were essentially no L1 cache misses (0.00%) for any of the functions, however when running through cachegrind all the times correctly add up.

This is how I'm running the executable:

cc -Wall -O3 -mavx2 -o main main.c && ./main

Here is main.c:

#include <stdint.h>
#include <stdio.h>
#include <time.h>
#include <math.h>

// ---------------------------------------------------------------------
// -------------------- benchmarking harness ---------------------------
int64_t ITERS = 100000000;

double black_box(double x) {
    asm("" : : "r"(&x) : "memory");
    return x;
}

uint64_t nanosec() {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec * 1000000000ull + ts.tv_nsec;
}

double bench(double (*f)()) {
    // Warmup
    for (int i = 0; i < ITERS / 10; i++) {
        black_box(f());
    }

    uint64_t start = nanosec();
    for (int i = 0; i < ITERS; i++) {
        black_box(f());
    }
    uint64_t end = nanosec();

    return (double)(end - start) / (double)ITERS;
}
// -------------------- end benchmarking harness -----------------------
// ---------------------------------------------------------------------

#define LEN 32
#define SUM_LEN 24

double VALS[LEN] = {
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
};

__attribute__ ((noinline))
double sum24(double* ptr) {
    double sum = 0.;
    for (int i = 0; i < 24; i++) {
        sum += ptr[i];
    }
    return sum;
}

__attribute__ ((noinline))
double withtan() {
    double a = sum24(VALS);
    double b = sum24(VALS + 1);
    double c = sum24(VALS + 2);
    double d = sum24(VALS + 3);

    return tan(a + b + c + d);
}

__attribute__ ((noinline))
double notan() {
    double a = sum24(VALS);
    double b = sum24(VALS + 1);
    double c = sum24(VALS + 2);
    double d = sum24(VALS + 3);

    return a + b + c + d;
}

__attribute__ ((noinline))
double justtan() {
    return tan(black_box(96));
}

__attribute__ ((noinline))
double empty() {
    return 1.;
}

int main() {
    printf("justtan(): ~%f ns/iter\n", bench(justtan));
    printf("notan():   ~%f ns/iter\n", bench(notan));
    printf("withtan(): ~%f ns/iter\n", bench(withtan));
    printf("empty():   ~%f ns/iter\n", bench(empty));
}

Why is tan slower in context than when isolated?

noshwins
  • 101
  • 4
  • 2
    Start by getting rid of cache and other effects: duplicate the whole of your main block. Anything you do the first time is slower. – Victor Eijkhout Nov 29 '21 at 13:59
  • 2
    This is down to your measuring technique. My all time favourite result due to measuring: https://www.nature.com/articles/nature.2011.9393 – Bathsheba Nov 29 '21 at 14:01
  • Not related but it is funny as vectorization makes the code slower – 0___________ Nov 29 '21 at 14:02
  • @Bathsheba: I take it you don't agree with the results? The article doesn't exactly refute them. – Robert Harvey Nov 29 '21 at 14:04
  • @VictorEijkhout I've flipped the order of tests run in the example now and the times still do not add up (updated code and output). I've also tested this previously with multiple executables and only one bench per executable, and the result was the same. I figure the warmup should take care of cache effects. – noshwins Nov 29 '21 at 14:05
  • 1
    @RobertHarvey: [Its refutation and the update of the paper are well known.](https://en.wikipedia.org/wiki/Faster-than-light_neutrino_anomaly) – Eric Postpischil Nov 29 '21 at 14:05
  • @EricPostpischil: I am happy to claim ignorance since, like most people, I'm not all-knowing or all-seeing. This isn't exactly on-topic anyway. What were we talking about? – Robert Harvey Nov 29 '21 at 14:07
  • 1
    @Bathsheba what is incorrect about the measurement? FWIW the disassembled machine code looks correct for all functions, it doesn't seem to any kind of optimization that is removing code. – noshwins Nov 29 '21 at 14:08
  • 2
    Perhaps remove the neutrinos? – Robert Harvey Nov 29 '21 at 14:09
  • @RobertHarvey Good luck finding all of them. ;-) – Andrew Henle Nov 29 '21 at 14:13
  • @VictorEijkhout There is also no difference if I duplicate the body of the main block. – noshwins Nov 29 '21 at 14:33
  • Could be that the longer dependency chain created by sums -> tan is too long for ouf-of-order execution to overlap independent iterations as well as with either of the shorter two? Like in [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](https://stackoverflow.com/q/51986046) for the no-lfence version, with longer dep chains it has worse throughput. – Peter Cordes Nov 29 '21 at 15:27
  • 2
    And BTW, adding up costs doesn't work in general on superscalar OoO exec CPUs, over small scales like this. ([What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391)). I think you're expecting `withtan()` to have the same throughput cost as `justtan()` plus `notan()`, but you don't say so, just seem to implicitly assume it. Usually putting two different things in a loop lets them overlap when they're independent, so often the combined cost is equal or lower. – Peter Cordes Nov 29 '21 at 15:35
  • `ts.tv_sec * 1000000000` risks 32-bit overflow. Use `ts.tv_sec * 1000000000ull` and enable all warnings. – chux - Reinstate Monica Nov 29 '21 at 15:45
  • @noshwins Why `asm("" : : "r"(&x) : "memory");`? – chux - Reinstate Monica Nov 29 '21 at 16:02
  • 1
    @PeterCordes I would expect that any dependency chain issue would make the combined `withtan` function faster rather than slower? That would give ~45ns or *less* rather than the 60 I'm seeing, given that the functions could interleave better? I've updated the question to clarify I expect ~45ns or less. To test this I commented out `b, c, d` so that only the calculation for `a` was left, and the numbers were ~15/7/30ns for justtan/notan/withtan. This leaves a ~8ns missing delta; which is shorter than the original question's 15ns, even though I would expect the function to interleave worse. – noshwins Nov 29 '21 at 16:06
  • @chux-ReinstateMonica I've updated the question to use -Wall and `1000000000ull`, without any difference to the result. The asm call is to avoid the optimizer from seeing through `black_box`, I took it from here: https://doc.rust-lang.org/stable/src/core/hint.rs.html#163 Checking the assembly, it seems to work as expected. – noshwins Nov 29 '21 at 16:09
  • @chux-ReinstateMonica: `asm("" : : "r"(&x) : "memory");` tells the compiler to pass the address of `x` to the assembly code (which forces the compiler to put an up-to-date instance of `x` in memory, in case it has been keeping it a register) and tells the compiler that the assembly code may change anything in memory. Currently, Clang is, I think, blind to the fact that the assembly code in `""` is empty and hence neither uses the address nor actually changes memory… – Eric Postpischil Nov 29 '21 at 16:13
  • @noshwins I would think `return (volatile double) x;` would work as well and be more portable than `asm("" : : "r"(&x) : "memory"); return x;`. – chux - Reinstate Monica Nov 29 '21 at 16:14
  • … So this prevents the compiler from optimizing away `black_box(f())`. Were it not for the `asm`, that would have no observable effect, and the compiler could remove it and the loop it is in. – Eric Postpischil Nov 29 '21 at 16:14
  • @chux-ReinstateMonica: Values cannot be volatile; only objects can be volatile. Qualifiers at the top level in casts have no effect. You could make the parameter volatile or stash it in a volatile object and then retrieve it. – Eric Postpischil Nov 29 '21 at 16:15
  • @EricPostpischil Perhaps `double black_box(volatile double x) { return x; }`? – chux - Reinstate Monica Nov 29 '21 at 16:18
  • @EricPostpischil: re [this comment](https://stackoverflow.com/questions/70155760/why-is-tan-slower-in-context-than-when-isolated?noredirect=1#comment124019522_70155760) - "which" is the wrong word. Materializing the pointed-to value in memory also requires the `"memory"` clobber; just the `"r"(&x)` alone doesn't do that. [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259). (The only diff between assigning to `volatile double sink = x;` is the reload can opt away if the `black_box()` ret val is unused.) – Peter Cordes Nov 29 '21 at 21:31
  • @chux-ReinstateMonica: That `asm` statement using a pointer in a register is portable to all ISAs (unlike for example `asm volatile("" : "+x" (x))` on x86 to avoid the store/reload while still forcing the compiler to materialize a value in an xmm register and forget about the value). Although it does require GNU C extensions. In this particular use-case, it may be important to have an `asm` statement with a `"memory"` clobber to force the compiler to forget about `VALS[]` and not hoist that sum out of the loop. So that effect on things other than `x` may be significant. – Peter Cordes Nov 29 '21 at 21:43

1 Answers1

2

The same behaviour is also btw seen in Macbook M1, where the numbers are 4 vs 14 vs 28 vs 0.3.

Using withtan = tan(black_box(96)) + a + b + c + d is 20 ns/iter, which to me hints about the tan(a+b+c+d) creating a dependency which the OoO unit can't break, where as computing all the sum_a, sum_b, sum_c, sum_d, tan(96) are independent tasks, which can be run out-of-order.

One of the issues also must be tan being long enough so OoO unit can't peek into the next independent iteration.

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • This is practically what Peter Cordes said in comments. – Aki Suihkonen Nov 29 '21 at 15:57
  • Naturally I can't make predictions of Intel microarchitecture specifics from Arm M1, but there are not that many algorithms for OoO. – Aki Suihkonen Nov 29 '21 at 16:02
  • I removed `b`, `c`, and `d` to test this; if it was due to dependency I would expect that code with just `tan(a)` would have a strictly worse dependency chain. My output times were 15/7/30, and as 30-15-7=22, there is 8ns extra in `withtan` than each individual function time summed up. This is strictly better than the result in the original question with 4 variables that has an unaccounted 15ns. – noshwins Nov 29 '21 at 16:30
  • 2
    I think the number 15/7/30 are still consistent with the theory that `a+a+a+...` can OoO execute independent loops, `a+b+c+d...` can also OoO execute and so does `tan+tan+tan...`. But `tan(whatever)` does not, because `tan` is too long. That hypothesis can be of course tested by having a shorter function, such as sqrt() instead (with -ffast-math) -- there the timings do add up, suggesting that next iteration for `a+b+...` can start while `sqrt` is still executing. – Aki Suihkonen Nov 29 '21 at 16:40
  • 1
    Actually with `sqrt(a+b+c+d)` timings 1.1 / 13.9 / 14.1 not only add up, but add "short" (if that's even a thing), with some expected performance improvement. – Aki Suihkonen Nov 29 '21 at 16:48
  • 1
    sqrt isn't just "a shorter function", it inlines to a single instruction (with `-fno-math-errno`). On M1, it's very heavily pipelined, I think. At least integer divide on M1 is apparently pipelined with 2c throughput (vs. 6c or 10c on Ice Lake). SKL/ICL `sqrtsd` is 4.5c throughput. So about one per nanosecond on a 4.5 GHz machine. Your 1.1 ns on your M1 (at some lower clock speed) shows it has better per-clock throughput for `double` sqrt, but presumably still bottlenecking on that rather than front-end call/return and store overhead. – Peter Cordes Nov 29 '21 at 21:56