2

Let's start by including the following:

#include <vector>
#include <random>
using namespace std;

Now, suppose that one has the following three std:vector<float>:

N = 1048576;
vector<float> a(N);
vector<float> b(N);
vector<float> c(N);

default_random_engine randomGenerator(time(0));
uniform_real_distribution<float> diceroll(0.0f, 1.0f);
for(int i-0; i<N; i++)
{
    a[i] = diceroll(randomGenerator);
    b[i] = diceroll(randomGenerator);
}

Now, assume that one needs to sum a and b element-wise and store the result in c, which in scalar form looks like the following:

for(int i=0; i<N; i++)
{
    c[i] = a[i] + b[i];
}

What would be the SSE2 vectorized version of the above code, keeping in mind that the inputs are a and b as defined above (i.e. as a collection of float) and ehe output is c (also a collection of float)?

After quite a bit of research, I was able to come up with the following:

for(int i=0; i<N; i+=4)
{
    float a_toload[4] = { a[i], a[i + 1], a[i + 2], a[i + 3] };
    float b_toload[4] = { b[i], b[i + 1], b[i + 2], b[i + 3] };
    __m128 loaded_a = _mm_loadu_ps(a_toload);
    __m128 loaded_b = _mm_loadu_ps(b_toload);

    float result[4] = { 0, 0, 0, 0 };
    _mm_storeu_ps(result, _mm_add_ps(loaded_a , loaded_b));
    c[i] = result[0];
    c[i + 1] = result[1];
    c[i + 2] = result[2];
    c[i + 3] = result[3];
}

However, this seems to be really cumbersome and is certainly quite inefficient: the SIMD version above is actually three times slower than the initial scalar version (measured, of course, with optimizations on, in release mode of Microsoft VS15, and after 1 million iterations, not just 12).

Kim Shutter
  • 187
  • 1
  • 7
  • 3
    You should examine the assembly of your original attempt without intrinsics. If you have auto-vectorization turned on, the compiler may vectorize for you. Also, if your arrays are not aligned, the loading into vector registers may not be as efficient or can even crash. – scottiedoo Sep 29 '16 at 03:24

2 Answers2

5

Your for-loop could be simplified to

const int aligendN = N - N % 4;
for (int i = 0; i < alignedN; i+=4) {
    _mm_storeu_ps(&c[i], 
                  _mm_add_ps(_mm_loadu_ps(&a[i]), 
                  _mm_loadu_ps(&b[i])));
}
for (int i = alignedN; i < N; ++i) {
    c[i] = a[i] + b[i];
}

Some additional explanation:

  1. A small loop handling the last several floats is quite common and when N%4 != 0 or N is unknown at compile time it is mandatory.
  2. I notice that you choose unaligned version load/store, there is small penalty compared to aligned version. I found this link at stackoverflow: Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?
Jason L.
  • 741
  • 6
  • 11
  • If `N` is a signed type like `int`, `N % 4` is less efficient than you'd like, having to make asm that could produce `-123 % 4 = -3` if run with that input. So hopefully you used `unsigned N`, but then signed int i and alignedN would be a bad idea. (Use `size_t` for N, i, and alignedN; this is exactly what it's for and it's an unsigned type.) Otherwise yes, good; generally I don't like the style of putting too many intrinsincs nested with each other, but this seems just barely ok without any named `__m128` variables. – Peter Cordes May 03 '21 at 15:43
3

You don't need the intermediate arrays to load to the SSE registers. Just load directly from your arrays.

auto loaded_a = _mm_loadu_ps(&a[i]);
auto loaded_b = _mm_loadu_ps(&b[i]);
_mm_storeu_ps(&c[i], _mm_add_ps(loaded_a, loaded_b));

You could also omit the two loaded variables and incorporate those into the add, although the compiler should do that for you.

You need to be careful with this, as it won't work right if the vector sizes are not a multiple of 4 (you'll access past the end of the array, resulting in Undefined Behavior, and the write past the end of c could be damaging).

1201ProgramAlarm
  • 32,384
  • 7
  • 42
  • 56
  • These days it is pretty commonly advertised that "the best way" is to use an openMP pragma, or simd pragma, and get the compiler to vectorise it for you. This is because next year, or the year after, you will need to use an AVX, AVX2, or AVX-512 and then do it all again. If your compiler is doing the it for you then you just need to recompile on the next machine, and not recode it. Along @Jason L and 1201ProgramAlarm comments, there are also pragmas for alignment of vectors. I put mine on 512bit bounds so that the code is "AVX-512 ready". – Holmz Oct 01 '16 at 04:12