2

I'm looking for the most performant way to compute all first integer powers of floating X inside SSE-128/AVX-256/AVX-512 register (128 and 256 and 512 bits), e.g. for float AVX1-256 I want to get in register X^1, X^2, X^3, X^4, X^5, X^6, X^7, X^8 given X on input. I need code for both float-32 and double-64.

I implemented this code for AVX-256/float-32 case:

Try it online!

__m256 _mm256_powers_ps(float x) {
    float const x2 = x * x, x4 = x2 * x2;
    return _mm256_mul_ps(
        _mm256_mul_ps(
            _mm256_setr_ps( x, x2,  x, x2,  x, x2,  x, x2),
            _mm256_setr_ps( 1,  1, x2, x2,  1,  1, x2, x2)
        ),
            _mm256_setr_ps( 1,  1,  1,  1, x4, x4, x4, x4)
    );
}

I designed it mainly to look simple. But I think performance-wise it can be made faster, maybe one or two less multiplications. Can anybody suggest a faster version? Maybe there is even some single AVX instruction for computing these powers?

I'm interested in both float-32 and double-64 versions for all of 128-bit (4 floats or 2 doubles) and 256-bit (8 floats or 4 doubles) and 512-bit SIMD registers (16 floats or 8 doubles).

I also implemented possibly faster version, looking more complex, but that has one less multiplication, speed comparison wasn't done:

Try it online!

__m256 _mm256_powers_ps(float x) {
    float const x2 = x * x;
    __m256 const r = _mm256_mul_ps(
        _mm256_setr_ps( x, x2,  x, x2,  x, x2,  x, x2),
        _mm256_setr_ps( 1,  1, x2, x2,  1,  1, x2, x2)
    );
    return _mm256_mul_ps(r,
        _mm256_setr_m128(_mm_set1_ps(1),
            _mm_permute_ps(_mm256_castps256_ps128(r), 0b11'11'11'11))
    );
}

I was also thinking about that simple non-SIMD solution might be also fast, because of good parallel pipelining of many independent multiplications:

Try it online!

__m256 _mm256_powers_ps(float x) {
    auto const x2 = x * x;
    auto const x3 = x2 * x;
    auto const x4 = x2 * x2;
    return _mm256_setr_ps(
        x, x2, x3, x4, x4 * x, x4 * x2, x4 * x3, x4 * x4);
}

Note: these powers of X are needed for the stage of computing float or double precision polynomial, see my other question regarding computing polynomials on SIMD. This polynomial may have different degrees, sometimes 3, sometimes 6, sometimes 9, sometimes 15, even 25 or 32 or 43 degree. This numbers are arbitrary, actually there could be used whole range of 1-48 degrees of polynomials. Coefficients of polynomials are given beforehand as constants. But value X for which it should be computed is not known in advance. Of cause I will use FMA module to compute poly value itself, but precomputed X powers are needed for computing poly using SIMD.

Most common target CPU that will be used is Intel Xeon Gold 6230 that has AVX-512, so I need to optimize code for it.

Arty
  • 14,883
  • 6
  • 36
  • 69
  • 6
    This design is flawed. SSE/AVX works best when you have *multiple* data to work with, with a strong emphasis on *multiple*. But you have only *one* datum, `X`. Do not expect a large gain by using SSE/AVX for your specification. – j6t Jun 13 '21 at 07:16
  • 2
    Since you're trying to get a speedup for one single `x` input, what surrounding code does this have to overlap with? Is it more sensitive to latency of this operation, or to front-end throughput cost, or back-end port pressure on any given port? As @j6t said, if you have more than one `x` value to do this for, compute vectors with the same operation on the whole vector. If necessary, transpose a whole 8x8 set of vectors at the end if you need to store `x1, x2, x3, ...` to memory in that order, so you're not wasting your shuffles. – Peter Cordes Jun 13 '21 at 07:24
  • @j6t Why SIMD? Because these powers are needed as a part of SIMD flow, there are a lot of vectorized operations but in between them there is this powers-computing stage. So I thought it could be done also in SSE/AVX registers somehow. More than that if you look at my code above it does 4 multiplications (3 multiplications in 2nd version), instead of 7 multiplications needed for non-SIMD version, so it means SIMD helps greatly here! – Arty Jun 13 '21 at 07:24
  • 2
    (Also, check the asm for `_mm256_setr_ps` - it might be doing a bad job and not realizing that it's just a 64-bit broadcast of `x, x2`, which it could do with `vmovlhdup xmm0, xmm1` / `vmulss xmm0, xmm0` / `vbroadcastpD ymm0, xmm0`) – Peter Cordes Jun 13 '21 at 07:25
  • @PeterCordes I need to optimize latency. I have only single `X`. This X powers computing stage appears in the middle of other SIMD operations. – Arty Jun 13 '21 at 07:26
  • 1
    It's FP math, so of course it's going to be done in SIMD registers one way or another on x86-64; scalar FP uses XMM regs. But if it's part of a sequence of SIMD operations, what form are your inputs actually available in, and what exact form do you need the result in? (And could that be changed?) – Peter Cordes Jun 13 '21 at 07:27
  • 1
    So it's part of one long dependency chain involving the just one vector of data? Not part of something you're doing to multiple vectors of data, which would let out-of-order exec overlap independent work? Most use-cases for SIMD have enough data parallelism that it's easy to create instruction-level parallelism, so it's rare for data latency to be the only thing that matters with SIMD. – Peter Cordes Jun 13 '21 at 07:28
  • @PeterCordes Actually I have X in input as `double` type. It can be also feeded as lowest word of 128 or 256 register if needed. Result is needed as 256-register. – Arty Jun 13 '21 at 07:29
  • 2
    @PeterCordes Yes, there is a lot of full-vector work done, then from results of this work single X is extracted, powers computed into 256-register, and again a lot of SIMD work. It is not a part of huge data my full function gets only single data point. – Arty Jun 13 '21 at 07:34
  • 2
    Ok, that finally makes some sense. I guess there's no later work that's independent of `x`? Anyway, you could just as easily have `x` as the low element of a `__m256` or `__m128`. Or if it wasn't already the low element, you could `vpermps` to broadcast it *instead* of just shuffling it to the bottom. That makes it easier to optimize the shuffling instead of having to use scalar `_mm_set`. – Peter Cordes Jun 13 '21 at 07:40
  • 2
    Note that there is a `_mm_mul_ss` intrinsics for `[v]mulss` to just multiply the low scalar element and leave the others unmodified; this looks like a use-case for it. – Peter Cordes Jun 13 '21 at 07:40
  • @PeterCordes Without doing extra speed measurement can you please look at two of my code snippets above and tell whether second is faster? Second does one less multiplication, but does other extra permutation stuff. While first one having one more multiplication still has two multiplications happenning in parallel so should be pipelined well. So might be that first one could be even faster because of good pipelining and simplicity. – Arty Jun 13 '21 at 07:51
  • 2
    You say "extra permutation stuff", but the first one has 3 `_mm256_setr_ps`, which if done naively could take a lot of shuffles each! You aren't even trying to construct one from the other, and I'd guess only clang has much chance of noticing the similarities between them and creating one from the other by shuffling or blending it, instead of restarting from the scalar inputs. Are you talking about the asm? If so, which compilers do you care about it compiling efficiently with? (Actually in your Godbolt link, even gcc isn't as bad as I thought it might be, but certainly not great.) – Peter Cordes Jun 13 '21 at 08:00
  • 2
    Have you tried using IACA or LLVM-MCA to do simple pipeline analysis (including of latency bottlenecks)? They're usually only set up to handle loops, so you might have to craft a caller that recycles a vector back into a scalar float inside a loop, to make this code the loop-carried dependency. Then you can try stuff on your own and get some automated feedback. (And it would give you something to actually microbenchmark on real hardware.) Also note that you don't need to include a `main` on Godbolt; that's just clutter. – Peter Cordes Jun 13 '21 at 08:05
  • @PeterCordes Right now I target CLang compiler, it will be a part of C++ code. But in the future MSVC/GCC could be needed too. – Arty Jun 13 '21 at 08:06
  • @PeterCordes Never heard of `IACA or LLVM-MCA` and don't know how to use them. But would be great to learn! – Arty Jun 13 '21 at 08:07
  • 2
    If you care about it not sucking with MSVC, you'd better avoid that much `_mm_setr_ps` then. Look at all the naive `vinsertps` in https://godbolt.org/z/bj5rK9jfP. Re: IACA and LLVM-MCA: see [(How) can I predict the runtime of a code snippet using LLVM Machine Code Analyzer?](https://stackoverflow.com/q/54107384) and [What is IACA and how do I use it?](https://stackoverflow.com/q/26021337). and google them. Godbolt can add an llvm-mca "analysis" view to compiler output. – Peter Cordes Jun 13 '21 at 08:08
  • @PeterCordes Just updated my quesion's post, added 3rd code snippet with non-SIMD version of multiplications, which could be even faster due to good pipelining of independent multiplications. – Arty Jun 13 '21 at 08:20
  • @MarekR Just added 3rd snippet to my question, with plain C++ generic code of multiplications. There is 'Try it online`-link there from which you can see that CLang created quite simple and stragihtforward code of single-double multiplications using `vmulss` instructions. So it didn't do any combined SSE/AVX instructions. – Arty Jun 13 '21 at 08:28
  • 2
    @Arty: yeah leaving it all to the compiler has good ILP, but Intel CPUs before Ice Lake only have 1/clock shuffle throughput, so all those `vinsertps` instructions to implement `_mm256_setr_ps` (https://godbolt.org/z/fxxvsnncx) can't run in parallel with each other because of resource conflicts (limited back-end throughput), so the minimum latency isn't as good as one might hope. Still, it does avoid chaining together any multiplies, and at 4c latency those are much slower than 1c shuffles. Worth considering. – Peter Cordes Jun 13 '21 at 08:29
  • 1
    @Arty: Err wait a minute, there's inherently a dependency between multiply instructions in stuff like `x4 = x2*x2` having to wait for x2 to be ready. So overlapping the multiply and shuffle latencies is probably better, if done carefully. – Peter Cordes Jun 13 '21 at 08:32
  • 1
    Can you arrange for your previous code to already have `x` broadcast to a `__m128` without using any extra shuffle instructions? e.g. if it's the sum of an array, do your hsum with shuffles that leave the result everywhere, i.e. high <-> low instead of high -> low shuffles. Or is it naturally at the bottom of a vector without any extra shuffles? (Or is it say the 2nd element, and the rest hold garbage?) – Peter Cordes Jun 13 '21 at 12:08
  • 1
    (Along with `mulss`, that would make `unpcklps` more powerful for creating a vector like x, x2, x,x2. `shufps` needs both elements of each 64-bit half to come from the same source, so it's hard to use with `x` and `x2` in the bottom of different vectors. `vpermps` only has 1 input. and `unpcklpd` / `movlhps` aren't much help.) – Peter Cordes Jun 13 '21 at 12:18
  • @PeterCordes You may think that X comes at the bottom of YMM register, rest is 0 (actually maybe garbage, didn't check). – Arty Jun 13 '21 at 13:39
  • 1
    What do you want to do with this function. Regarding the usage, it may be possible to significantly speed up the code based on math tricks. – Jérôme Richard Jun 13 '21 at 15:58
  • 1
    A couple design ideas: https://godbolt.org/z/zqGv51rn3, both with the same critical path latency (3 multiplies, 4x 1c shuffles, 1x 3c lane-crossing shuffle). Two of the shuffles can run in parallel on Zen / Ice Lake. One version has an extra multiply (starting in the same cycle as another.) I was hoping to get more benefit from `mulss`. I might write that up as an answer sometime. – Peter Cordes Jun 13 '21 at 15:58
  • @PeterCordes Thanks for approximate code at godbolt. Forgot to mention. For my task I actually I need solutions for all of 128, 256, 512 registers (with max degree of 4, 8, 16). Not because of different target CPUs but because of different cases of input - some inputs will need 4 powers, some 8 powers, some 16 powers, depending on some threshold. Also I need both float-32 and double-64 solutions. This is just a note for you if it will influence your solution somehow. – Arty Jun 13 '21 at 19:15
  • Wait, do you have AVX-512 available even for the narrow versions?? `vpermt2ps` could save a shuffle, and merge-masking for multiplies and shuffles could be very useful. Can vector and mask constants usefully be reused, or is there a lot of work to do between these steps? (I guess it could be worth reloading a constant if it saves critical path latency, as long as it doesn't cache miss more than it saves. Setting up constants is of course always off the critical path.) Or do you need this for actual legacy-SSE and AVX1 as well? For legacy-SSE, extra movaps register copies can matter. – Peter Cordes Jun 13 '21 at 19:29
  • @PeterCordes I can't guarantee any target CPU, so I have to write code SSE2/AVX1/AVX-512 separately. But even for AVX-512 machine at some thresholds it could happen that 256 version is faster than 512 version due to mathematics of this powers precomputation. Of cause if it will help on AVX-512 machine one can do 256 computation using half of 512 register, if half avx-512 will be faster for some reason than avx1-256. But I need to also target CPUs that have only SSE2 or CPUs that have only SSE2+AVX1 or CPUs that have all of SSE2+AVX1/2+AVX-512. – Arty Jun 13 '21 at 19:36
  • @PeterCordes For example for input of some imaginary value `K <= 6` it is faster to use 128 bit computation even on AVX-512 machine, so I have to use 25% of whole AVX-512 register to gain more speed. For other threshold `6 < K <= 13` I need to use 256 bit even on AVX-512 machine. And for `13 < K` I can use full AVX-512. `K` here is just some imaginary parameter that controls threhold of decision about choosing 128/256/512, it is not number of floats. So 512 is not always the fastest way to compute final task. Not the task of powers but full task of my function. – Arty Jun 13 '21 at 19:43
  • Do you care about any of these versions more than others? That's too many different versions to hand-optimize them all separately, unless this is a really critical part of your program and saving a cycle of latency would matter significantly. (I mean, if you wanted to pay me to optimize every different version of this for every different combo of SIMD feature level and vector / element width, I'd certainly be willing to do it. And also tune different versions for Zen 1 vs. Ice Lake vs. Skylake if you want. But for an SO answer, that's obviously way too much repetition.) – Peter Cordes Jun 13 '21 at 20:01
  • AVX-512 provides some fundamentally new tools that are great for any vector width, so I'd expect that to be significantly different. I haven't seen a use yet for `insertps` and my ideas so far for AVX1 basically uses things that would work in SSE2 (no need to widen to 256 at an early stage when that would be redundant and need a bunch of `1.0`), and builds the __m256 result out of a __m128. I haven't looked at trying to minimize `movaps` instructions in it, though. But I expect `double` will be significantly different from `float`, probably much easier. – Peter Cordes Jun 13 '21 at 20:06
  • 1
    @PeterCordes I think most of my CPUs will have AVX-512, so we can stick to this version. But as I told even within avx-512 I 100% need all three 128/256/512. Meaning that for some input numbers I need 4 float powers for some 8 float powers for some 16 float powers. So even if 512 register is used then for 4 float powers we need to use only first 128 bits of it. The real-real most common task of inputs will be case of 256 bit, meaning that as you already started you can use AVX1-256 register with 8 floats and somehow I will generalize it to other cases by myself. Actually 16 floats is rare. – Arty Jun 13 '21 at 20:09
  • Ok, so you're mostly going to run this on modern servers, I guess? So you probably care about tuning for Skylake-X, and possibly future Ice Lake. You should really put some of those details in the question, and tag it avx-512. – Peter Cordes Jun 13 '21 at 20:12
  • @PeterCordes Basically I need sometimes to compute just 4 powers (float or double), sometimes 8 powers (float or double), sometimes 16 powers (float or double). And this doesn't depend on CPU, but depends on input values to my function. – Arty Jun 13 '21 at 20:12
  • Yeah, you already clearly stated the correctness requirements. I was wondering about tuning, like I said. Ice Lake has another shuffle unit on port 1 that can run a few commonly-used shuffle uops; SKX doesn't. – Peter Cordes Jun 13 '21 at 20:14
  • @PeterCordes My code will be run mostly on [Intel Xeon Gold 6230](https://ark.intel.com/content/www/us/en/ark/products/192437/intel-xeon-gold-6230-processor-27-5m-cache-2-10-ghz.html) that has AVX-512. – Arty Jun 13 '21 at 20:16
  • @PeterCordes Actually for you to know - these powers of `X` are needed for the stage of computing float or double precision polynomial. This polynomial may have different degrees, sometimes 3, sometimes 6, sometimes 9, sometimes 15, even 25 or 32 degree. This numbers are arbitrary, actually there could be used whole range of 1-32 degrees of polynomials. Coefficients of polynomials are given beforehand as constants. But value `X` for which it should be computed is not known in advance. Of cause I will use FMA module to compute poly value itself, but powers are needed too. – Arty Jun 13 '21 at 20:23
  • 1
    When you only need a 256-bit result, are the surrounding instructions still going to involve some 512-bit vector instructions? If yes, the vector ALUs on port 1 will be shut down, making it impossible to start two multiplies and a shuffle in the same cycle. That *probably* won't matter, and since you say there's no independent work, it's just this code limited by ILP. – Peter Cordes Jun 13 '21 at 20:25
  • @PeterCordes When I need 256 result then all surrounding instructions will also be just 256-bit. – Arty Jun 13 '21 at 20:26
  • You're using this to evaluate polynomials?? Why not FMA with the coefficients as you go, like normal? See https://github.com/vectorclass/version2/blob/ff7450acfad9d3a7c6825d92cfb782a42ccfa71f/vectormath_common.h#L161 (including the comment about shortening dep chains for higher-order polynomials.) Or do you need all the terms separately, not just a sum? – Peter Cordes Jun 13 '21 at 20:29
  • @PeterCordes Actually if I compute powers `X^1, ..., X^8` then afterwards I need somehow to split them into two registers, first register should contain `1, X^1, ..., X^7` (note 1 in first place), second register should contain `X^8, X^8, ... X^8` (value of X^8 broadcasted). So if you know how to convert to such two registers please tell. – Arty Jun 13 '21 at 20:33
  • Um, why not compute the `1.0, x, x2, x3, ...` you need in the first place, instead of `x1 .. x8`? Computing set1(x^8) alongside that sequence is probably just as easy. But if you did make things harder for yourself, use AVX512F `valignd` to shift in a 1.0, and use `vpermps` to broadcast the top element. – Peter Cordes Jun 13 '21 at 20:37
  • @PeterCordes In your linked source poly is computed 1-by-1 value at a time, without using SIMD (actually using SIMD but with single FMA value, i.e. `_mm256_fmadd_ss()`). But if my poly has degree like 48, of cause it is much faster to precompute powers `X^1... X^8` and then do FMA with 8 or 16 floats at a time. You linked source is optimal for degrees like below 6. Starting from degree around 8 SSE2 helps already. And from degree 16 AVX1 helps, from 32 AVX512 is useful. And I have a lot of 32-48-degree polys. – Arty Jun 13 '21 at 20:37
  • 1
    It's optimized for computing the same polynomial on 4 or 8 different `x` values in parallel, using SIMD for what it's good at (vertical operations, doing the same thing to multiple values, instead of horizontal stuff). If you can't do that, it might still be even more efficient to incorporate the coefficients at some point as you go. Or some kind of hybrid strategy involving some different coefficients of the same polynomial in the same vector, but also maybe using FMA. e.g. maybe getting the even and odd powers running, or just the powers from 1..4, so it's less shuffling to hsum later. – Peter Cordes Jun 13 '21 at 20:44
  • 1
    Yeah, thinking about this some more, it probably makes sense to widen to at most 1/4 the degree of the polynomial, and loop with multiple accumulators over the array of coefficients. e.g. two vectors of x1 .. x4 (or the other being x5..x8 maybe?), looping FMAs over the array of coefficients (starting from the end, or storing it starting with the highest coefficient in which case you want x4..x1). You get from x1..x4 to x9..x12 by multiplying by set1(x8), so you do need to create that from your set1(x4) vector, but in parallel with the FMAs from the array. – Peter Cordes Jun 13 '21 at 21:55
  • 1
    I think at the end you can combine two adjacent accumulators with just an FMA with set1(x4), then (if you had 4 accumulators) those pairs with set1(x8). That should get all the summed coefficients multiplied by enough `x`s to be the right final value. Then you have an hsum of 4 elements, which takes 2 shuffles and 2 adds. – Peter Cordes Jun 13 '21 at 22:00
  • 1
    I think using SIMD to speed up evaluating a single polynomial is a more useful and widely applicable question than what you've asked; you should probably modify it to ask about the larger problem instead, since this is a case where optimizing work into surrounding code makes sense. Or maybe ask a separate question, and this question can just be a building block. (And usually focused on only needing to widen a limited amount. Widening at startup means another shuffle/add reduce step at the end so it comes at a high price.) – Peter Cordes Jun 13 '21 at 22:02
  • @PeterCordes By asking another question about computing poly in SIMD I'm afraid of getting negative feedback, like that poly computation is not for SIMD at all, that SIMD is mostly useful for vertical vectoring, etc. But I did my own investigation, I already implemented non-SIMD variant and SIMD (SSE2, AVX1) and compared timings of all these - and it appeared that SSE2 variant gives about `2x` speedup on large poly like 30 degree, and on smaller polys like 8-10 degrees even on them it gives `1.4-1.5x` speedup.AVX1 gives `3.5x` speedup on large polys (30 degree) and about `1.5x` on smaller 8,10 – Arty Jun 14 '21 at 04:06
  • @PeterCordes Just created [another question](https://stackoverflow.com/questions/67955549/) as you suggested, regarding computing polys on SIMD. – Arty Jun 14 '21 at 05:24
  • @PeterCordes Have you noticed my previous comment? Also can you tell - if I have AVX-512, does it mean that SSE2 and AVX1 sets within this CPU are implemented using AVX-512 set of computational modules within CPU? Meaning that SSE2/AVX1 within CPU don't have separate computational modules, they use AVX-512 modules? Does it mean that speed of same math operations for SSE2, AVX1, AVX-512 will be same on given fixed CPU? For example if I call `_mm_mul_ps()` or `_mm256_mul_ps()` or `_mm512_mul_ps()` then all of them will take same number of cycles on same CPU? – Arty Jun 14 '21 at 07:42
  • 1
    Look at https://uops.info/ and https://agner.org/optimize/. `vmulps xmm` (and FMA) competes for the same execution ports as `vmulps ymm` and `zmm`. (Although with 512-bit ZMM there's some funny business, disabling port 1, but there's a separate 512-bit FMA unit on port 5.) – Peter Cordes Jun 14 '21 at 07:54
  • @PeterCordes Does it mean if I use only half of 512 register (upper half is zero) then it is better to use `_mm256_mul_ps()` for casted lower part instead of using `_mm512_mul_ps()`? Same for FMA? – Arty Jun 14 '21 at 08:01
  • 1
    Yes, of course. It wastes less power (allowing higher turbo), and also avoids clock speed throttling effects of using wide SIMD instructions. Only use 512-bit vectors when you can really make full use of them for extended periods. [SIMD instructions lowering CPU frequency](https://stackoverflow.com/q/56852812) – Peter Cordes Jun 14 '21 at 08:04
  • @PeterCordes Also always wanted to ask a question to somebody - do you know if CLang can detect independent generic floats math patterns and somehow automatically arrange this operations into vectors of SSE2/AVX1/AVX512? Do you need to specify `-ffast-math` for this? I noticed that CLang combines sometimes into SSE2-128 registers two double operations. But I never noticed that it combines into AVX1-256 (4 doubles) operations, even if `-mavx` flag is set. Is there any way to suggest CLang to combine into AVX1-256 and AVX-512 by some special command line flag? – Arty Jun 14 '21 at 08:09
  • It's a question of whether that's worth doing. Use `-march=native` not just `-mavx` to tell clang what machine it's tuning for, not just what ISA extensions are available. But note that lane-crossing shuffles have higher latency (3c vs. 1c) so it's less often worth it to do too much shuffling together. Obviously in a loop or something, it will auto-vectorize over an array with wide vectors (up to the default prefer-vector-width, which is often 256 even on AVX-512-capable systems.) -ffast-math may sometimes help even in cases where you an optimization seems legal without, esp. for GCC. – Peter Cordes Jun 14 '21 at 08:13
  • @PeterCordes Thanks! Have you noticed [my other question](https://stackoverflow.com/questions/67965164/efficiently-evaluate-large-polynomials-using-simd) regarding SIMD poly eval, you wanted/suggested it to be asked? – Arty Jun 14 '21 at 08:22
  • Yes, I have noticed it. Haven't gotten around to doing more than reading it yet. – Peter Cordes Jun 14 '21 at 08:39
  • @PeterCordes Can you also suggest, if some C++ inline function has more than one resulting SIMD register then what is the most efficient for compiler way to return those registers? For example for 3 return-registers shall I have return value like `std::tuple<__m256, __m256, __m256>` or maybe is better to have single `__m256` return plus two pointers `__m256 * ret` as function params? Or even references like `__m256 & ret` as params? In case if it is inline function. Or tuple and pointer and reference solutions are same for optimizing on compiler side?Also if not inline function what is better? – Arty Jun 14 '21 at 08:52
  • 1
    By reference or pointer output args normally optimize away fully after a function inlines. Probably so would a std::tuple. Check the asm on Godbolt, of course. – Peter Cordes Jun 14 '21 at 09:52
  • @PeterCordes One more question related to poly eval - what is the most efficient way to compute sum of all 4 floats inside 128 register? Sum should be returned to C++ as regular float. Is it through doing `_mm_hadd_ps()` two times and then `_mm_cvtss_f32()`? Or through doing `_mm_store_ps()` to some tmp aligned array `a` of 4 floats and then computing `a[0] + a[1] + a[2] + a[3]`? Which way is faster? What could be cycle-timing difference for this two ways? What about same question for 256 register (sum of 8 floats)? – Arty Jun 14 '21 at 11:07
  • 1
    If you'd googled, you'd probably have found [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026), which covers that in detail. – Peter Cordes Jun 14 '21 at 11:19
  • @PeterCordes Another question if you don't mind - if I use avx512's `_mm512_mul_ps()` how many independent such operations can be done in parallel on modern CPU for example Skylake? At Intel's intrinsic's guide I see Latency 4 and Throughtput 0.5 for Skylake. What does it actually mean? Does it mean that if I have 4 adjacent independent `_mm512_mul_ps()` instructions then all 4 will run in parallel? So that 1st instruction will finish at cycle 4, 2nd at cycle 4.5, 3rd at cycle 5, 4th at cycle 5.5? What about 8 instructions? What is the optimal number? 2 or 4 or 8 mul instructions? – Arty Jun 14 '21 at 19:27
  • 1
    It means 8 (independent) mul/add/FMAs in flight to saturate the FMA units, starting two every cycle. [latency vs throughput in intel intrinsics](https://stackoverflow.com/q/40878534) / [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527). Of course, with 512-bit uops there are only 2 execution ports for *any* vector ALU instruction, so shuffles and blends (but not loads/stores) compete for that 2/clock back-end throughput limit. – Peter Cordes Jun 14 '21 at 19:39
  • 1
    See also [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) for more basics on instruction-level parallelism and throughput vs. latency. – Peter Cordes Jun 14 '21 at 19:41
  • @PeterCordes Thanks! Before I read all these links, can you tell if there is any online calculator for this? For example I tell calculator that I use instruction `mul` or maybe `fmadd` and tell what CPU I have for example Skylake. Then calculator tells me how many optimal independent instructions of this kind should I place adjacent to each other to saturate whole multiplying module? I need to know this also for older CPUs and also for AVX1-256 and SSE2-128, also float-32 and double-64. – Arty Jun 14 '21 at 19:45
  • 1
    LLVM-MCA analysis on Godbolt, like we already discussed, can analyze a loop and find latency and/or throughput bottlenecks, and show you uops -> ports distribution, so you can see if there are lots of spare cycles on ports 0 and 1 (or ports 0 and 5 for 512-bit FMAs). Look at https://uops.info/ for instructions -> ports without having to write code using them. – Peter Cordes Jun 14 '21 at 19:48

0 Answers0