Efficiently evaluate large polynomials using SIMD

Question

I have quite large (20-40 degree) slowly-converging (sometimes) floating-point polynomials. I'd like to optimize their evaluation using SIMD (SSE2, AVX1, AVX-512). I need both float-32 and double-64 solutions.

Value of coefficients are constants given in advance, and value of X to evaluate poly at is given as function argument.

Important note - I have just single input X for my function. So I can't do vertical optimization by computing poly for 8-16 Xs same time. It means I need some horizontal optimization within evaluation for single X.

I created related question that helps me to compute powers of X (e.g. X^1, X^2, ..., X^8), that are needed for SIMD evaluation.

It is obvious that SIMD should be used only after some threshold of polynomial degree, for quite small polys straight generic (non-SIMD) Horner's (or Estrin's) based method can be used like here. Also SIMD width (128 or 256 or 512) should be chosen based on poly degree.

Below I implemented AVX-256-Float32 variant using a kind of modified Horner's Method suited for SIMD (multiplying by x^8 instead of x^1). Credits to @PeterCordes for fast horizontal sum tutorial. Click on try-it-online link, it contains larger code which also has reference simple evaluation for comparison and time measurements:

Try it online!

template <size_t S, size_t I, typename MT = __m256, size_t Cnt>
inline MT EvalPoly8xF32Helper(MT xhi,
        std::array<float, Cnt> const & A, MT r = _mm256_undefined_ps()) {
    size_t constexpr K = 8;
    if constexpr(I + K >= S)
        r = _mm256_load_ps(&A[I]);
    else {
        #ifdef __FMA__
            r = _mm256_fmadd_ps(r, xhi, _mm256_load_ps(&A[I]));
        #else
            r = _mm256_add_ps(_mm256_mul_ps(r, xhi),
                _mm256_load_ps(&A[I]));
        #endif
    }
    if constexpr(I < K)
        return r;
    else
        return EvalPoly8xF32Helper<S, I - K>(xhi, A, r);
}

inline float _mm_fast_hsum_ps(__m128 v) {
    __m128 shuf = _mm_movehdup_ps(v);
    __m128 sums = _mm_add_ps(v, shuf);
    shuf        = _mm_movehl_ps(shuf, sums);
    sums        = _mm_add_ss(sums, shuf);
    return        _mm_cvtss_f32(sums);
}

template <size_t S, size_t Cnt>
inline float EvalPoly8xF32(
        float x, std::array<float, Cnt> const & A) {
    auto constexpr K = 8;
    auto const x2 = x * x, x4 = x2 * x2, x8 = x4 * x4, x3 = x2 * x;
    auto const powsx = _mm256_setr_ps(
        1, x, x2, x3, x4, x4 * x, x4 * x2, x4 * x3);
    auto r0 = EvalPoly8xF32Helper<S, (S - 1) / K * K>(
        _mm256_set1_ps(x8), A);
    r0 = _mm256_mul_ps(r0, powsx);
    return _mm_fast_hsum_ps(_mm_add_ps(
        _mm256_castps256_ps128(r0), _mm256_extractf128_ps(r0, 1)));
}

As one can see SIMD version gives quite large speedup compared to reference simple implementation. For AVX1-256-float32 and degree 32 case it gives around 4.5x times speedup (for degree 16 it gives 1.8x speedup which is also good)! Obviously that even just using FMA instructions inside reference implementations will already improve computation speed noticeably.

My question is whether you can suggest a faster method of evaluating polynomial, or even some ready-made code or library, or any optimizations to my code.

Most common target CPU that will be used is Intel Xeon Gold 6230 that has AVX-512, so I need to optimize code for it.

You can try highly optimize compiler ```ICC Intel``` with ```IPP library``` for CPU level optimization. just saying — Roshan M, Jun 14 '21 at 05:27
If you just want to evaluate a single polynomial, are you sure this actually is performance critical? Also, for polynomials of that size you likely need to care about the order of evaluation, since you may get numerical cancellation effects. If the possible range of `X` is limited, you may try to approximate your polynomial with a lower degree polynomial. — chtz, Jun 14 '21 at 07:24
@chtz I need to evaluate same poly (same coefficients) for different `X` and I have millions of such `X`. The only thing that these millions come at different time. I.e. I can't compute for 10 Xs at a time, I only have single X provided to my function. And it is performance critical place, this evaluation should be as fast as possible. Also I can't approximate this polys with lower degree poly, because I need exactly precision of this poly, without extra loss through approximation. Also these polys are already designed precisely for given X range. — Arty, Jun 14 '21 at 07:28
Are you able to factor your large polynomial into a product of 4 or 8 lower degree polynomials (maybe with additional constant offsets `p(x) = (p0(x)*p1(x)+c0)*(p2(x)*p3(x)+c1) + c2`)? Then you could evaluate those in parallel and multiply them together. — chtz, Jun 15 '21 at 07:08
@chtz I can only factor out smaller polys only if this factoring is possible somehow for any possible poly. Can you suggest a link to such algorithm then? Because I build Optimal polys for given range. Meaning that they are not any arbitrary poly, Optimal poly exists only one and it is such that maximal error on range is minimal among all polys. We can relax demand of Optimality if one can build higher degree poly achieving both properties - 1) having small enough maximal error for given range 2) being possible to be easily factored. Can you suggest algorithm of finding such poly? — Arty, Jun 15 '21 at 07:52
It's not always possible to factor a polynomial using only real coefficients. It *is* always possible to factor into complex coefficients, for example [wolfram](https://www.wolframalpha.com/input/?i=factor+-0.4+%2B+1.25+*+x++%2B+14.9+x%5E2%2B0.75+x%5E3+%2B+1.1+x%5E4+%2B+14.9+x%5E5). This may not be helpful, although that example shows just two terms with complex coefficients. If enough of the terms are purely real, it could be a win. (But maybe only if you could JIT generate code based on factoring; you probably need a strategy that's specific to how many real vs. complex coeffs. @chtz) — Peter Cordes, Jun 15 '21 at 08:50
@PeterCordes BTW, can you suggest any useful tutorial link about how can I JIT my code? Actually if I have arbitrary C++ code as a function I want to be able to compile it while running EXE program. I understand that I have to link CLang library to my binary. But besides linking what else is needed? What libclang's functions should I call to compile code from RAM? How do I dynamically link resulting function to my code in RAM then? Are all these questions covered in some good online tutorial if you know? — Arty, Jun 15 '21 at 08:56
@PeterCordes I don't remember precisely but my feeling says that ANY poly can be factored into real polys if you add a special constant to it, meaning that any poly can be presented as `A(x) = B(x) * C(x) + D` in all real values, for some D. I'm not sure about this fact, but feels to me that this is true. — Arty, Jun 15 '21 at 08:59
Yes, *if* you took that approach at all, you'd want to use libclang to do it. You're generating one function that you'll reuse many times, so its full optimizer would be good (instead of faster optimizers that make worse machine code, like HotSpot or MS's C# JIT.) I haven't used libclang, but I'd guess that it can give you back a function pointer that you can call. You don't have to "link" anything, just dereference the function pointer that points into an executable page of memory containing the machine code. — Peter Cordes, Jun 15 '21 at 08:59
@Arty: re: factoring into `A(x) = B(x) * C(x) + D` - perhaps, but that probably doesn't guarantee that `B` and `C` will be of similar degrees. Factoring a 16-degree polynomial into 14 x 2 or 15 x 1 wouldn't be useful. Perhaps worth looking into, though, since evaluating two or more polynomials in parallel (in separate SIMD elements of the same vector) would allow straightforward Horner's Rule / Estrin's Scheme, leaving the horizontal cleanup as shuffle / mul instead of shuffle / add. It's been a long time since undergrad math classes for me, so I'm sure there's lots of tricks I don't know. — Peter Cordes, Jun 15 '21 at 09:04
@PeterCordes You can always factor a polynomial with real coefficients into real polynomials of degrees 1 or 2 (for every complex root, the conjugate is a root as well). I don't have an overview over numerically stable algorithms for that, though. — chtz, Jun 15 '21 at 10:47
@chtz: Ah, yes, numerical stability might be a serious problem, e.g. defeating the purpose of using double and having so many terms. But if the factorized coefficients don't create catastrophic cancellation, doing many low-degree polynomials in parallel is very good with SIMD FMA, then you're left reducing those with a tree of multiply ops. — Peter Cordes, Jun 15 '21 at 10:55
@chtz Would be great to find numerical methods of finding all roots of poly. I don't know of such. Real and complex roots. — Arty, Jun 15 '21 at 11:01
@PeterCordes I'm reading uops.info, can you tell what UOPs mean there in table? — Arty, Jun 15 '21 at 11:01
micro-ops. See https://agner.org/optimize/ (microarch guide), https://www.realworldtech.com/sandy-bridge/, and maybe Intel's optimization manual (although Intel's guide is huge and sprawling, but there is a section about Skylake internals. Read Agner Fog's guide first, it covers the essential concepts to understand how x86 instructions are decoded to uops and go through the pipeline.) — Peter Cordes, Jun 15 '21 at 11:06
@Arty I suggest to start reading here https://en.wikipedia.org/wiki/Root-finding_algorithms#Roots_of_polynomials -- I don't think there is one "optimal" algorithm for this, since performance will likely depend on the condition of the polynomials. — chtz, Jun 15 '21 at 11:29
@PeterCordes If I factor my poly into quadratic polys, can you then suggest what is the fastest way to compute this? Single quadratic poly is computed obviously, just one addtion and one FMA operation. But what next? I get then N number that I need to multiply as fast as possible? What is the fastest way to multiply N numbers using any SIMD operations? Of cause probably there are different algos/code for different threshold of N. Do you have any ready-made link to some tutorial (or code) about how to multiply N numbers as fast as possible? — Arty, Jun 15 '21 at 11:36
@chtz BTW, thanks a lot for factoring of polynomial suggestion. This helped me a lot, and will make my code much faster, I hope. I already started to code this solution of applying factoring to speedup evaluation of poly. Will post results here after some time. — Arty, Jun 15 '21 at 12:17
@Arty: mulps performs exactly the same as addps, and is also commutative and (approximately for FP) associative. As I said, a tree of multiplies down to one vector, then horizontally reduce that one vector down to scalar, with exactly the same shuffles as for [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026). Exactly like if you needed the sum of an array. Unless you have like 8 full vectors of data, using wider vectors is just a tradeoff between more shuffles on the critical path vs. closer to saturating 2 mul/clock — Peter Cordes, Jun 15 '21 at 13:16
@PeterCordes If I need to do N independent multiplications and N independent additions (not for this task), what is the best strategy of order of instructions? Do I need to place all N multiplications and after them N additions? Or it is better to interleave them, `1 add 1 mul 1 add 1 mul ... Nth add Nth mul`? Also in general is it performance-wise better to place same kind of independent instructions adjacent to each other or it is better to mix/interleave different kinds of instructions? — Arty, Jun 15 '21 at 17:18
All modern x86 CPUs have out-of-order exec, but you can make its life easier by "software pipelining" so independent work appears first. i.e. interleave separate dependency chains. As you can see from uops.info, on your CPU mul and add are literally identical in how the CPU handles them. In fact they both run on the exact same FMA unit. If you can preprocess your coefficients so you can FMA instead of add+mul, that would save latency and throughput. e.g. `(x + a ) * b` is `x*b + a*b` = `fma(x, b, a*b)` so you still have two constants, just different values. — Peter Cordes, Jun 15 '21 at 17:38
@PeterCordes Actually I wanted to know that for some other task, in which `add` and `mul` are indenpendent of each other. So in this case will it help improving performance to write `mul mul mul add add add` or `mul add mul add mul add`, here all 6 operations are independent on each other. So if all 6 are independent is there any point of interleaving operations of different kind? Or is it better to group same operations together? I know case of 6 ops will be irrelevant, but what about N=16 or 32 or 64 ops? I want code-wise to help CPU to do out of order execution in most performant way. — Arty, Jun 15 '21 at 18:01
No, it won't, like I said the only thing that matters on SKX is their dependency pattern. The only thing that's different between mulps and addps are what they do to the data inside the FMA execution unit that runs them, at which point there's no interaction with other instructions. (However, earlier CPUs before Skylake, do have a separate SIMD FP-add unit on only one port, so in that case it's best to schedule the adds earlier, so multiply instructions can see that port already has uops queued, and get [scheduled to a different port](https://stackoverflow.com/q/40681331)). — Peter Cordes, Jun 15 '21 at 18:31

Efficiently evaluate large polynomials using SIMD

0 Answers0

Linked