I have quite large (20-40 degree) slowly-converging (sometimes) floating-point polynomials. I'd like to optimize their evaluation using SIMD (SSE2, AVX1, AVX-512). I need both float-32 and double-64 solutions.
Value of coefficients are constants given in advance, and value of X
to evaluate poly at is given as function argument.
Important note - I have just single input X
for my function. So I can't do vertical optimization by computing poly for 8-16 Xs
same time. It means I need some horizontal optimization within evaluation for single X
.
I created related question that helps me to compute powers of X
(e.g. X^1, X^2, ..., X^8
), that are needed for SIMD evaluation.
It is obvious that SIMD should be used only after some threshold of polynomial degree, for quite small polys straight generic (non-SIMD) Horner's (or Estrin's) based method can be used like here. Also SIMD width (128 or 256 or 512) should be chosen based on poly degree.
Below I implemented AVX-256-Float32 variant using a kind of modified Horner's Method suited for SIMD (multiplying by x^8
instead of x^1
). Credits to @PeterCordes for fast horizontal sum tutorial. Click on try-it-online link, it contains larger code which also has reference simple evaluation for comparison and time measurements:
template <size_t S, size_t I, typename MT = __m256, size_t Cnt>
inline MT EvalPoly8xF32Helper(MT xhi,
std::array<float, Cnt> const & A, MT r = _mm256_undefined_ps()) {
size_t constexpr K = 8;
if constexpr(I + K >= S)
r = _mm256_load_ps(&A[I]);
else {
#ifdef __FMA__
r = _mm256_fmadd_ps(r, xhi, _mm256_load_ps(&A[I]));
#else
r = _mm256_add_ps(_mm256_mul_ps(r, xhi),
_mm256_load_ps(&A[I]));
#endif
}
if constexpr(I < K)
return r;
else
return EvalPoly8xF32Helper<S, I - K>(xhi, A, r);
}
inline float _mm_fast_hsum_ps(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v);
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums);
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
template <size_t S, size_t Cnt>
inline float EvalPoly8xF32(
float x, std::array<float, Cnt> const & A) {
auto constexpr K = 8;
auto const x2 = x * x, x4 = x2 * x2, x8 = x4 * x4, x3 = x2 * x;
auto const powsx = _mm256_setr_ps(
1, x, x2, x3, x4, x4 * x, x4 * x2, x4 * x3);
auto r0 = EvalPoly8xF32Helper<S, (S - 1) / K * K>(
_mm256_set1_ps(x8), A);
r0 = _mm256_mul_ps(r0, powsx);
return _mm_fast_hsum_ps(_mm_add_ps(
_mm256_castps256_ps128(r0), _mm256_extractf128_ps(r0, 1)));
}
As one can see SIMD version gives quite large speedup compared to reference simple implementation. For AVX1-256-float32 and degree 32
case it gives around 4.5x
times speedup (for degree 16
it gives 1.8x
speedup which is also good)! Obviously that even just using FMA
instructions inside reference implementations will already improve computation speed noticeably.
My question is whether you can suggest a faster method of evaluating polynomial, or even some ready-made code or library, or any optimizations to my code.
Most common target CPU that will be used is Intel Xeon Gold 6230 that has AVX-512, so I need to optimize code for it.