4

This has me baffled/intrigued, why is this code

[see assembly]

void maxArray(double* x, double* y) {
    for (int i = 0; i < 65536; i++) {
        x[i] = ((y[i] > x[i]) ? y[i] : x[i]);
    }
}

...faster than this code?

[see assembly]

void maxArray(double* x, double* y) {
    for (int i = 0; i < 65536; i++) {
        if (y[i] > x[i]) x[i] = y[i];
    }
}

and for the record the resulting assembly in the first one is identical to the expanded version:

inline double fn(double a, double b) {
    if (a > b) {
        return a;
    } else {
        return b;
    }
}
void maxArray(double* x, double* y) {
    for (int i = 0; i < 65536; i++) {
        x[i] = fn(y[i], x[i]);
    }
}

[see assembly]

I get the difference. The first one is setting x[i] to a condition, and the middle is conditionally setting x[i]. Both have conditions though, so both have branches? Is it because the expanded if statement for the latter is optimized into the vector assembly max command, and the former is, for some reason not recognized as a max function?

gcc 10.3 x86_64 -Ofast -march=native

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
Alasdair
  • 13,348
  • 18
  • 82
  • 138
  • You can see the exact compiler and flags if you click on any of the links. – Alasdair Apr 16 '21 at 10:43
  • Or you could include that information in the question so that clicking links is not necessary. Besides, the links did not load for me for some reason. – klutt Apr 16 '21 at 10:45
  • How many iterations - run over what time span? Was the array content always the same? It must be at least 1 second to be useful. – Weather Vane Apr 16 '21 at 10:47
  • 2
    Looks like this is critically dependent on the target CPU. You've used `-march=native`, which is whatever random CPU is used to host godbolt.org – MSalters Apr 16 '21 at 10:47
  • Did you also consider `x[i] = std::max(x[i], y[i])` in your performance evaluation? – ypnos Apr 16 '21 at 10:49
  • gcc x86_64 latest -Ofast -march=native – Alasdair Apr 16 '21 at 10:50
  • 1
    In the second variant, either the compiler didn't recognize the source code as an implementation of a maximum calculation, or it tried to more closely adhere to the source. In the first and third variant you have assignments in both branches, in the second variant you have an assignment in one branch only. – Bodo Apr 16 '21 at 11:07
  • 2
    I suspect the compiler is reluctant to generate a store when you didn't ask for one. Suppose for instance that the `x` array is in read-only memory, but that you know that its entries are all greater than the entries in `y`. The second code, read strictly, ought to work (and do nothing), because the test is never true and `x[i]` is never assigned to. But the first or third versions would crash. As such, the compiler should not optimize #2 into the equivalent of #1, even if it knew #1 would be faster. – Nate Eldredge Apr 16 '21 at 20:25
  • 3
    @NateEldredge: Yes, exactly, C compilers must not invent writes, for thread-safety if nothing else. (So even checking for at least one element to be updated in an aligned vector wouldn't make it ok to store them all). [Is it possible to use SIMD instruction for replace?](https://stackoverflow.com/a/48285594), and see also this ICC bug: [Crash with icc: can the compiler invent writes where none existed in the abstract machine?](https://stackoverflow.com/q/54524947) – Peter Cordes Apr 16 '21 at 21:09
  • @Alasdair: What CPU did you actually test this on? Does it have AVX-512 like the servers Godbolt runs on? Were you using the same GCC (trunk) nightly build on your machine that you linked to on Godbolt (or did they make the same asm)? (Although even without AVX-512, GCC makes similar asm using a `vmaskmovpd` masked store). Also, how much faster, on what array size? – Peter Cordes Apr 16 '21 at 21:23

1 Answers1

5

It should be clear from the generated assembly code. In the first case you get:

# x[i] = ((y[i] > x[i]) ? y[i] : x[i])

vmovupd ymm1, YMMWORD PTR [rsi+rax
vmaxpd  ymm0, ymm1, YMMWORD PTR [rdi+rax]
vmovupd YMMWORD PTR [rdi+rax], ymm0

While in the second case you get:

# if (y[i] > x[i]) x[i] = y[i];

vmovupd  ymm0, YMMWORD PTR [rsi+rax]
vcmppd   k1, ymm0, YMMWORD PTR [rdi+rax], 14
kortestb k1, k1
je       .L3
vmovupd  YMMWORD PTR [rdi+rax]{k1}, ymm0

As you can see in the first snippet, the compiler used the VMAXPD specialized instruction to compute the maximum of two double precision floating point values, without branching. In the second snippet though, there is a compare (VCMPPD) followed by a test (KORTESTB) and a branch (JE).

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
  • Interesting that the compiler didn't get that. Clang produces different code for the all three, with the last (function version) being the most optimized. – Alasdair Apr 16 '21 at 10:55
  • @Alasdair yeah, when you are down to this level of optimization it's quite hard to get the right code, even different versions of the same compiler could change the generated code. – Marco Bonelli Apr 16 '21 at 10:56
  • @Alasdair A more correct name for optimizers would be "improvers". They don't find the optimal solution. – klutt Apr 16 '21 at 11:09
  • @klutt but thats... what optimization means. improvement. otherwise they'd be called optimalers :D – Shark Apr 16 '21 at 11:15
  • @Shark True, but many seem to think that it's "strange" as soon as the optimizer misses an opportunity to optimize ;) – klutt Apr 16 '21 at 11:17
  • it's likely related to one of the most upvoted questions on SO. https://stackoverflow.com/q/11227809/4989451 – tstanisl Apr 16 '21 at 13:08
  • 1
    @tstanisl no, that's unrelated, the problem there is the data, not the branch, which is always present, the code is always the same. Here instead the problem is that in one case the branch is there and in one case it is not, the code is different. Don't confuse the two! – Marco Bonelli Apr 16 '21 at 13:19
  • @MarcoBonelli: It's hardly that simple! Both ways are vectorized (with AVX512 if you use `-march=native` on Godbolt's server, could be totally different on the OP's actual machine); the branch only branches if *no* elements are to be stored from the whole vector. Note the masked store to `[rdi+rax]{k1}`. (Probably because the store can be slow with an all-zero mask, especially in a copy-on-write situation where it may need a microcode assist to not actually dirty the page.) – Peter Cordes Apr 16 '21 at 21:01
  • 3
    @Alasdair: it can't optimize to just `max` + vector store because it's not allowed to invent writes. [Crash with icc: can the compiler invent writes where none existed in the abstract machine?](https://stackoverflow.com/q/54524947) and [AVX-512 and Branching](https://stackoverflow.com/q/47481762) – Peter Cordes Apr 16 '21 at 21:28