Equivalent SIMD instruction for multiplying specific array elements

Question

I just understood how to get a dot-product of 2 arrays (as in the following code):

int A[8] = {1,2,3,4,5,1,2,3};
int B[8] = {2,3,4,5,6,2,3,4};

float result = 0;

for (int i = 0; i < 8; i ++) {
    result += A[i] * B[i];
}

is equivalent to (in SIMD):

int A[8] = {1,2,3,4,5,1,2,3};
int B[8] = {2,3,4,5,6,2,3,4};

float result = 0;

__m128 r1 = {0,0,0,0};
__m128 r2 = {0,0,0,0};
__m128 r3 = {0,0,0,0};

for (int i = 0; i < 8; i += 4) {
  float C[4] = {A[i], A[i+1], A[i+2], A[i+3]};
  float D[4] = {B[i], B[i+1], B[i+2], B[i+3]};
  __m128 a = _mm_loadu_ps(C);
  __m128 b = _mm_loadu_ps(D);

  r1 = _mm_mul_ps(a,b);
  r2 = _mm_hadd_ps(r1, r1);
  r3 = _mm_add_ss(_mm_hadd_ps(r2, r2), r3);
  _mm_store_ss(&result, r3);
}

I am curious now how to get the equivalent code in SIMD if I want to multiply elements that aren't consecutive in the array. For example, if I wanted to perform the following, what would be the equivalent in SIMD?

int A[8] = {1,2,3,4,5,1,2,3};
int B[8] = {2,3,4,5,6,2,3,4};

float result = 0;
for (int i = 0; i < 8; i++) {
    for (int j = 0; j < 8; j++) {
        result += A[foo(i)] * B[foo(j)]
    }
}

foo is just some function that returns an int as some function of the input argument.

Have you had a crack at it yourself? Trying and failing will teach you faster than being given the answer. — IKavanagh, Aug 18 '15 at 22:04
Is there a confusion in your data types between `int` and `float` and `__m128`? — Weather Vane, Aug 18 '15 at 22:06
Also, while you are at it, try using OpenCL. On an Intel system, this lets you by-pass the need to compile the code for each processor - while getting the same speed as the intrinsic. — Mikhail, Aug 18 '15 at 22:06
You ought to focus on getting the first example right before moving on to anything more complex. There are several mistakes, and even when these are fixed the code is far from optimal. Since SIMD is all about getting close to maximum theoretical best case performance you really have to pay a lot more attention to detail and be prepared to keep iterating through the optimisation process until the code is nearly perfect. — Paul R, Aug 18 '15 at 22:29
Well that probably deserves to be a Stack Overflow question in its own right, but you should probably get rid of the int to float conversions, do everything in the integer domain, and move the horizontal operations outside the loop, just for starters. You need to tighten up the requirements too, e.g. how many elements max, what is the range of the input data, etc. — Paul R, Aug 18 '15 at 22:39
Search on the `[sse]` or `[simd]` tags here on StackOverflow - there are many good questions and answers which are worth studying. — Paul R, Aug 18 '15 at 22:54

score 1 · Answer 1 · answered Aug 19 '15 at 10:58

If I had to do this task, I would do it as follows:

int A[8] = {1,2,3,4,5,1,2,3};
int B[8] = {2,3,4,5,6,2,3,4};

float PA[8], PB[8];
for (int i = 0; i < 8; i++) 
{
    PA[i] = A[foo(i)]; 
    PB[i] = B[foo(i)]; 
}

__m128 sums = _mm_set1_ps(0);
for (int i = 0; i < 8; i++) 
{
    __m128 a = _mm_set1_ps(PA[i]);
    for (int j = 0; j < 8; j += 4) 
    {
        __m128 b = _mm_loadu_ps(PB + j);
        sums = _mm_add_ps(sums, _mm_mul_ps(a, b));
    }
}
float results[4];
_mm_storeu_ps(results, sums);
float result = results[0] + results[1] + results[2] + results[3];

score 0 · Answer 2 · edited May 23 '17 at 12:15

Generally speaking, SIMD does not like such things as random access to individual elements. However, there are still several tricks that can be used.

If the indices provided foo are known at compile time, you can probably shuffle both vectors to align their elements properly. Just look at intrinsics in swizzle category in the Intrinsics Guide. Most certainly you'll need something like _mm_shuffle_ps and _mm_unpackXX_ps. Also various shift/align instructions may be useful.

With AVX2, you can try to use gather instructions. For float type in 32-bit mode you can use _mm_i32gather_ps or _mm256_i32gather_ps intrinsics. However, @PaulR writes here that they are no faster than trivial scalar loads.

Another solution may be possible with _mm_shuffle_epi8 instrinsic from SSSE3. It is a great instruction that allows to perform in-register gather operation with granularity of individual bytes. However, creating the shuffle mask is not a simple task. This paper (read sections 3.1 and 4) shows how to extend this approach to input arrays larger than one XMM register, but it seems that for 64 and more elements it is no longer better than scalar code.

Equivalent SIMD instruction for multiplying specific array elements

2 Answers2