How to efficiently vectorize polynomial computation with condition (roofline model)

Question

I want to apply a polynomial of small degree (2-5) to a vector of whose length can be between 50 and 3000, and do this as efficiently as possible. Example: For example, we can take the function: (1+x^2)^3, when x>3 and 0 when x<=3. Such a function would be executed 100k times for vectors of double elements. The size of each vector can be anything between 50 and 3000.

One idea would be to use Eigen: Eigen::ArrayXd v; then simply apply a functor: v.unaryExpr([&](double x) {return x>3 ? std::pow((1+x*x), 3.00) : 0.00;});

Trying with both GCC 9 and GCC 10, I saw that this loop is not being vectorized. I did vectorize it manually, only to see that the gain is much smaller than I expected (1.5x). I also replaced the conditioning with logical AND instructions, basically executing both branches and zeroing out the result when x<=3. I presume that the gain came mostly from the lack of branch misprediction.

Some considerations There are multiple factors at play. First of all, there are RAW dependencies in my code (using intrinsics). I am not sure how this affects the computation. I wrote my code with AVX2 so I was expecting a 4x gain. I presume that this plays a role, but I cannot be sure, as the CPU has out-of-order-processing. Another problem is that I am unsure if the performance of the loop I am trying to write is bound by the memory bandwidth.

Question How can I determine if either the memory bandwidth or pipeline hazards are affecting the implementation of this loop? Where can I learn techniques to better vectorize this loop? Are there good tools for this in Eigenr MSVC or Linux? I am using an AMD CPU as opposed to Intel.

Under Linux, does `perf stat` work, and does `perf list` show you a bunch of perf counter events you could use on your AMD CPU? But anyway, the obvious problem is using `pow` instead of just cubing manually. Don't call `pow` for small integer exponents; you compiler might not turn it back into 2x `vmulps` (or `vmulpd` since you say you were only expecting a 4x speedup with AVX?) Also make sure you enable FMA, not just AVX2. e.g. `-O3 -march=native`, and possibly `-ffast-math` to see if that helps. — Peter Cordes, Aug 14 '20 at 07:52
@PeterCordes Thanks for the suggestion: I will try perf. I did the cubing manually. I did use -O3 -march=native. I'm looking for general ways to treat such problems. — Gabe, Aug 14 '20 at 08:07

Peter Cordes · Accepted Answer · 2020-08-14T20:00:43.697

2

You can fix the GCC missed optimization with -fno-trapping-math, which should really be the default because -ftrapping-math doesn't even fully work. It auto-vectorizes just fine with that option: https://godbolt.org/z/zfKjjq.

#include <stdlib.h>

void foo(double *arr, size_t n) {
    for (size_t i=0 ; i<n ; i++){
        double &tmp = arr[i];
        double sqrp1 = 1.0 + tmp*tmp;
        tmp = tmp>3 ? sqrp1*sqrp1*sqrp1 : 0;
    }
}

It's avoiding the multiplies in one side of the ternary because they could raise FP exceptions that C++ abstract machine wouldn't.

You'd hope that writing it with the cubing outside a ternary should let GCC auto-vectorize, because none of the FP math operations are conditional in the source. But it doesn't actually help: https://godbolt.org/z/c7Ms9G GCC's default -ftrapping-math still decides to branch on the input to avoid all the FP computation, potentially not raising an overflow (to infinity) exception that the C++ abstract machine would have raised. Or invalid if the input was NaN. This is the kind of thing I meant about -ftrapping-math not working. (related: How to force GCC to assume that a floating-point expression is non-negative?)

Clang also has no problem: https://godbolt.org/z/KvM9fh I'd suggest using clang -O3 -march=native -ffp-contract=fast to get FMAs across statements when FMA is available.

(In this case, -ffp-contract=on is sufficient to contract 1.0 + tmp*tmp within that one expression, but not across statements if you need to avoid that for Kahan summation for example. The clang default is apparently -ffp-contract=off, giving separate mulpd and addpd)

Of course you'll want to avoid std::pow with a small integer exponent. Compilers might not optimize that into just 2 multiplies and instead call a full pow function.

edited Aug 14 '20 at 20:00

answered Aug 14 '20 at 08:20

Peter Cordes

328,167
45
605
847

This is very interesting: clang seems to do unlrolling with fixed length 4. Does this technique have a name? I'm wondering if it helps with RAW pipeline hazards? – Gabe Aug 14 '20 at 09:18
@Gabe: Hiding latency by interleaving multiple dependency chains of work is called "software pipelining". It's not really necessary here; out-of-order execution by modern x86 CPUs can find that instruction-level parallelism across loop iterations in GCC's fully rolled up version, and have large enough ROB + scheduler to hide the critical path latency of FMA (4) + MUL (4) + MUL (4) + AND(1) = 13 cycles (plus load latency of maybe 6). (for Skylake). Remember that there's no loop-carried dependency (except `i`) so the work is independent. – Peter Cordes Aug 14 '20 at 17:35
But unrolling typically helps some for long-running loops, and reduces loop overhead. Also can help ramp up to full throughput slightly faster at startup or after stalls, getting more loads into the pipeline faster. But where your arrays lengths are only 50 to 3000, better measure, especially if they're not always a multiple of 4. Clang uses scalar cleanup after the by-4 loop, so you can have up to 15 scalar iterations, because there's no rolled-up SIMD or 128-bit SIMD cleanup to get closer. – Peter Cordes Aug 14 '20 at 17:38
Thanks for the explanation. I think I will try to get better acquainted with AMD's uProf. I'm interested in knowing how far from optimum (max possible FLOPS) I am. – Gabe Aug 14 '20 at 20:31
Can you recommend a book/tutorial that offers a more gentle introduction into these topics, with examples and explanations? – Gabe Aug 14 '20 at 20:32
1

@Gabe: Agner Fog's optimization guides are very good; he has a C++ and an assembly guide. https://agner.org/optimize/. I haven't read them for a long time, mostly I just refer to his microarch PDF for low level details on CPU internals. See also other performance info linked from https://stackoverflow.com/tags/x86/info – Peter Cordes Aug 14 '20 at 20:37

How to efficiently vectorize polynomial computation with condition (roofline model)

1 Answers1