Performance AVX/SSE assembly vs. intrinsics

Question

I'm just trying to check the optimum approach to optimizing some basic routines. In this case I tried very simply example of multiplying 2 float vectors together:

void Mul(float *src1, float *src2, float *dst)
{
    for (int i=0; i<cnt; i++) dst[i] = src1[i] * src2[i];
};

Plain C implementation is very slow. I did some external ASM using AVX and also tried using intrinsics. These are the test results (time, smaller is better):

ASM: 0.110
IPP: 0.125
Intrinsics: 0.18
Plain C++: 4.0

(compiled using MSVC 2013, SSE2, tried Intel Compiler, results were pretty much the same)

As you can see my ASM code beaten even Intel Performance Primitives (probably because I did lots of branches to ensure I can use the AVX aligned instructions). But I'd personally like to utilize the intrinsic approach, it's simply easier to manage and I was thinking the compiler should do the best job optimizing all the branches and stuff (my ASM code sucks in that matter imho, yet it is faster). So here's the code using intrinsics:

    int i;
    for (i=0; (MINTEGER)(dst + i) % 32 != 0 && i < cnt; i++) dst[i] = src1[i] * src2[i];

    if ((MINTEGER)(src1 + i) % 32 == 0)
    {
        if ((MINTEGER)(src2 + i) % 32 == 0)
        {
            for (; i<cnt-8; i+=8)
            {
                __m256 x = _mm256_load_ps( src1 + i); 
                __m256 y = _mm256_load_ps( src2 + i); 
                __m256 z = _mm256_mul_ps(x, y); 
                _mm256_store_ps(dst + i, z);
            };
        }
        else
        {
            for (; i<cnt-8; i+=8)
            {
                __m256 x = _mm256_load_ps( src1 + i); 
                __m256 y = _mm256_loadu_ps( src2 + i); 
                __m256 z = _mm256_mul_ps(x, y); 
                _mm256_store_ps(dst + i, z);
            };
        };
    }
    else
    {
        for (; i<cnt-8; i+=8)
        {
            __m256 x = _mm256_loadu_ps( src1 + i); 
            __m256 y = _mm256_loadu_ps( src2 + i); 
            __m256 z = _mm256_mul_ps(x, y); 
            _mm256_store_ps(dst + i, z);
        };
    };

    for (; i<cnt; i++) dst[i] = src1[i] * src2[i];

Simple: First get to an address where dst is aligned to 32 bytes, then branch to check which sources are aligned.

One problem is that the C++ implementations in the beginning and at the end are not using AVX unless I enable AVX in the compiler, which I do NOT want, because this should be just AVX specialization, but the software should work even on a platform, where AVX is not available. And sadly there seems to be no intrinsics for instructions such as vmovss, so there's probably a penalty for mixing AVX code with SSE, which the compiler uses. However even if I enabled AVX in the compiler, it still didn't get below 0.14.

Any ideas how to optimize this to make the instrisics reach the speed of the ASM code?

You could compile just the files that have the specialization with avx enabled. — Jester, Mar 17 '15 at 00:28
Why don't we just memcpy a short chunk to a known-aligned location and then memcpy to the real dst? — user3528438, Mar 17 '15 at 00:49
You're likely not comparing apples to apples (see Hurkyl's answer). Look at the assembly. — Z boson, Mar 17 '15 at 08:26
See this question [sum-of-overlapping-arrays-auto-vectorization-and-restrict](https://stackoverflow.com/questions/23651055/sum-of-overlapping-arrays-auto-vectorization-and-restrict). I looked at exactly the same thing. — Z boson, Mar 17 '15 at 08:27
Out of interest, what size are your arrays, what is the typical alignment use case (if there is one), and what do your timing figures (e.g. 0.18 for intrinsics) represent ? — Paul R, Mar 17 '15 at 14:26
This was just a test example, 0.18 was number of microseconds for array of 1024 samples I think, some think like that. In practice it can be anything from 1 sample to say 64k. — mrzacek mrzacek, Mar 17 '15 at 22:49

score 5 · Answer 1 · 2015-03-17T08:37:18.317

5

Your implementation with intrinsics is not the same function as your implementation in straight C: e.g. what if your function was called with arguments Mul(p, p, p+1)? You'll get different results. The pure C version is slow because the compiler is ensuring that the code does exactly what you said.

If you want the compiler to make optimizations based on the assumption that the three arrays do not overlap, you need to make that explicit:

void Mul(float *src1, float *src2, float *__restrict__ dst)

or even better

void Mul(const float *src1, const float *src2, float *__restrict__ dst)

(I think it's enough to have __restrict__ just on the output pointer, although it wouldn't hurt to add it to the input pointers too)

edited Mar 17 '15 at 08:37

answered Mar 17 '15 at 07:32

1

Good point. But [my experience with `__restrict__` in MSVC](https://stackoverflow.com/questions/23651055/sum-of-overlapping-arrays-auto-vectorization-and-restrict) is that it was ignored. – Z boson Mar 17 '15 at 08:30
1

@Z: I have little experience with MSVC for high-performance stuff, but Intel's compiler should be fine. – Mar 17 '15 at 08:35
Yeah, I would expect ICC would be fine but [I had other problems with ICC in the past](https://stackoverflow.com/questions/17031192/intel-c-compiler-icc-seems-to-ingnore-sse-avx-seetings). But admittedly I have little experience with ICC and that was a long time ago. – Z boson Mar 17 '15 at 08:39
@Zboson I don't know about vectorization, but `restrict` on MSVC does work for other things. I've gotten significant speedups from it. (yes on MSVC) – Mysticial Mar 17 '15 at 08:40
@Mysticial, okay, that's good to know. I only tried it in that question. I quit using MSVC awhile back. BTW, why do you still use MSVC? – Z boson Mar 17 '15 at 08:42
@Zboson 1) It generates better code than ICC for "normal code" on non-AVX processors. 2) It compiles a gazzilion times faster than ICC. 3) It's well integrated into Visual Studio. – Mysticial Mar 17 '15 at 09:26
Thanks folks, it helped A LOT indeed! However the compiler generated code is still much slower (like 2x). And yes, from my experience MSVC is much better than ICC - it's faster, the generated code is pretty much the same and mainly the ICC executables are like 2x bigger... – mrzacek mrzacek Mar 17 '15 at 13:50

score 1 · Answer 2 · answered Mar 17 '15 at 07:14

1

On CPUs with AVX there is very little penalty for using misaligned loads - I would suggest trading this small penalty off against all the extra logic you're using to check for alignment etc and just have a single loop + scalar code to handle any residual elements:

   for (i = 0; i <= cnt - 8; i += 8)
   {
        __m256 x = _mm256_loadu_ps(src1 + i); 
        __m256 y = _mm256_loadu_ps(src2 + i); 
        __m256 z = _mm256_mul_ps(x, y); 
        _mm256_storeu_ps(dst + i, z);
   }
   for ( ; i < cnt; i++)
   {
       dst[i] = src1[i] * src2[i];
   }

Better still, make sure that your buffers are all 32 byte aligned in the first place and then just use aligned loads/stores.

Note that performing a single arithmetic operation in a loop like this is generally a bad approach with SIMD - execution time will be largely dominated by loads and stores - you should try to combine this multiplication with other SIMD operations to mitigate the load/store cost.

answered Mar 17 '15 at 07:14

Paul R

208,748
37
389
560

Well, as you can see from the numbers my implementation in ASM was actually quite faster, say 10% than IPPs and the difference was that I did all the work with alignments and my branches are most likely very bad :). So probably the difference could be even bigger. However true, I just checked on a newest CPU with Haswell and the difference was much smaller there, but still there was some. Since this routine (and similar ones) is use A LOT, I'm really keen to get the best performance possible. – mrzacek mrzacek Mar 17 '15 at 12:08
I just tried your code on the Haswell i7 and it is actually 2x slower than my ASM implementation. I strongly believe the alignment is relevant really. – mrzacek mrzacek Mar 17 '15 at 12:14
Thanks - useful to know that my hunch was wrong. Please note my final paragraph though - I/O is your main bottleneck so you should try to combine more operations with the single multiply that you have currently. – Paul R Mar 17 '15 at 12:39
I just tried with Intel Compiler and same thing. From my experience intel's code is pretty much the same when it comes to performance, just MUCH bigger. About the other load/store mitigation - I understand, but in many cases this is all I need to do - multiply 2 arrays. – mrzacek mrzacek Mar 17 '15 at 13:44

Performance AVX/SSE assembly vs. intrinsics

2 Answers2