0

I'm working on matrix-vector multiplication + accumulation function for neural net, and I've finally decided to vectorize the whole thing manually instead of relying on autovectorization.

I came up with this function:

#include <immintrin.h>
#define re *restrict //just simplification
// computes a2[n2]=w[n2][n1]*a1[n1]+b[n2]
void l_frw(const int n2,const int n1,float re a2,const float re a1,const float w[restrict][n1],const float re b)
{
    __m256 x,y,z;
    __m256 one=_mm256_set1_ps(1.0f);
    for(int i=0; i<n2; i++)
    {
        a2[i]=b[i];
        z=_mm256_setzero_ps();
        for(int j=0; j<n1; j+=8)
        {
            x=_mm256_loadu_ps(&a1[j]);
            y=_mm256_loadu_ps(&w[i][j]);
            z=_mm256_fmadd_ps(x,y,z); //accumulates dot product of each row into z
        }
        z=_mm256_dp_ps(z,one,0b11111111);
        a2[i]+=z[0]+z[4];
    }
}

(Yes it works only with multiples of 8 sized vectors).

It is about 20% faster than the naive autovectorize version, which is pretty neat, but I'm still looking for improvements. Any suggestions on how to speed this up ?

PanJa
  • 15
  • 1
  • 5
  • Your memcpy call is buggy - it needs the size in bytes. It also seems to be redundant ? – Paul R Oct 31 '21 at 13:34
  • @PaulR It was not supposed to be there at all. Forgot to delete it from previous tests while copying the code in here. :) Thanks for noticing! – PanJa Oct 31 '21 at 14:29
  • You can hide the latency of `fmaddps` (and save some load overhead of `x`), by calculating rows `i` to `i+3` (or to `i+7`) in parallel. That would also offer a rare occasion to use `haddps` for the final reduction (`dpps` is inefficient in any case, but likely not the main bottleneck of your implementation). – chtz Oct 31 '21 at 16:14
  • @chtz By parallel you mean across multiple threads, or in SIMD ? I can't really imagine it, could you elaborate a little bit more ? – PanJa Oct 31 '21 at 17:16
  • 2
    I think @chtz just means parallelizing via pipelining and out-of-order execution. If you code several instructions in close succession that don't depend on each other's results, the machine can execute them simultaneously. So unroll the loop on `i` and interleave the `fmadd` operations from the various rows. – Nate Eldredge Oct 31 '21 at 17:33
  • @NateEldredge Gotcha! Unroll "i", loop a bit. I did similar thing in Mandelbrot renderer. – PanJa Oct 31 '21 at 17:46
  • Yes, as @Nate said. The trick is to increment `i` in the outer loop by `4` or `8` and make another inner loop from `i` to `i+7` inside the `j` loop (which you can unroll manually or hope that the compiler is able to do). Of course, you have to somehow handle cases where `n2` is not a multiple of `8`. – chtz Oct 31 '21 at 18:23
  • @chtz I've tried to unroll manually. Unrolling by factor of 2 or 8 slows down computation by about 2%. 4 seems to be the sweetspot with 2 - 4 % speed increase. I can see how it basically reduces number of loads of "a1" in the most inner loop. – PanJa Oct 31 '21 at 18:31
  • 1
    Got additional 2% speedup by using https://stackoverflow.com/a/13222410/15671081, instead of dp_ps. – PanJa Oct 31 '21 at 18:36
  • @chtz Also yes corner cases. Since it's used for NN it doesn't really matter; hidden layers can be multiples of 8 without any problems, and I'm willing to use padding for inputs/outputs. – PanJa Oct 31 '21 at 18:46
  • 3
    See [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) re: what @NateEldredge was suggesting, and some of the links to practical examples at the top of my answer there, as well as the latency vs. throughput bottleneck discussion of vector dot products in that answer. Using multiple accumulators is essential: it's not loop overhead that you're trying to minimize, it's latency of the chain of FMA operations through one accumulator. – Peter Cordes Oct 31 '21 at 20:08
  • 1
    I would not expect unrolling to slow down the iteration at all, unless just done wrongly... Primary point is to load `x` only once; the secondary point would be to shuffle the matrix `w` so, that you can just read it linearly. Third point is to check if `n1` is small enough that you can load the whole vector `x` in advance. – Aki Suihkonen Nov 01 '21 at 10:04

1 Answers1

-1

I recently discovered that _mm256_set1_ps/pd/si... is a pretty inefficient intrinsic/instruction (doesn't always evaluate to 1 instruction depending on the variable type)

See: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_set1_ps&ig_expand=6147

So you can use:

 const __m256 _one = {1.0f, 1.0f, 1.0f, 1.0f}; //(I think you need C11 or newer)

or if c++:

 constexpr inline __m256 _one = {1.0f, 1.0f, 1.0f, 1.0f};

This will work with __m256 and __m256d, however I believe that this is undefined behavior when used with __m256i as integer vector types can have a variable size depending on the width of your int.

ie: __m256i of chars can hold 32 chars, but only 4 unsigned long longs.

This also has the benefit of not actually calling any AVX instructions, it just initializes an array of 4 floats to 1.0f, to be referenced and used later, thus this style of initialization doesn't explicitly require hardware support for AVX. So you could have some constant vectors stored in a cross-platform .dll file and not need to worry about checking for AVX support until you actually call an AVX instruction.

dave_thenerd
  • 448
  • 3
  • 10
  • Citation needed. What compiler emits multiple instructions in an optimized build for `_mm256_set1_ps( 1.0f )`? Some waste even more instructions in a debug build, but so what? Performance in anti-optimized builds is pretty much irrelevant, and already a disaster with intriniscs. You're right that it would be even worse (even with optimization) to use `const __m256 one = _mm256_set1_ps(1.0f)` because most compilers suck at that and will not constant-propagate into a static initializer. – Peter Cordes Nov 15 '21 at 04:27
  • @PeterCordes See Intel's own website: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_set1_ps&ig_expand=6147 Mouse over "Sequence" Says "This intrinsic generates a sequence of instructions, which may perform worse than a native instruction. Consider the performance impact of this intrinsic." I've been working like mad replacing this intrinsic with the above in all my code, test it for yourself there is a performance difference. – dave_thenerd Nov 16 '21 at 04:59
  • That's talking about non-constant args. I did re-test it for myself, and [commented](//stackoverflow.com/questions/69011215/is-generally-faster-do-mm-set1-ps-once-and-reuse-it-or-set-it-on-the-fly/69968931#comment123685241_69968931) under another one of your answers that's repeating this claim: both ways compile to literally identical asm if you use them in a sane way, and your compiler isn't total garbage. Are you perhaps using an old version of MSVC, which couldn't hoist vector constant setup out of loops after inlining a function? If it's something else, please show a [mcve] of asm diff. – Peter Cordes Nov 16 '21 at 05:04
  • When I've used intrinsics professionally, I look at the compiler's asm output to make sure it compiled efficiently. Over years of doing stuff with intrinsics, `_mm_set1_ps( constant )` "on the fly" inside a function that uses it has always compiled efficiently with GCC and clang. I usually don't care much about MSVC, especially older MSVC that's well known to have serious missed optimizations. – Peter Cordes Nov 16 '21 at 05:07
  • When I've looked at recent MSVC on Godbolt, it's also been fine, like I showed in the links in my comments on your other question, so like I said, on non-terrible compilers with optimization enabled, there's no benefit to changing your code like this, the asm comes out the same. (But also little downside other than style, except sometimes having constants duplicated). – Peter Cordes Nov 16 '21 at 05:13