What is the fastest way for a multithread SIMD operation explicitly?

Question

Using intrinsics is a common method for SIMDizing. For example, I can perform a single add instruction on eight integers by _mm256_add_epi32. It needs two _mm256_load_si256 and one _mm256_store_si256 after addition as follows:

__m256i vec1 = _mm256_load_si256((__m256i *)&A[0]); // almost 5 cycles
__m256i vec2 = _mm256_load_si256((__m256i *)&B[0]); // almost 5 cycles
__m256i vec3 = _mm256_add_epi32( vec1 , vec2); // almost 1 cycle
_mm256_store_si256((__m256i *)&C[0], vec3); // almost 5

It perform the instructions on the single core of the CPU. My Core i7 has 8 core (4 real); I want to send the operations to all cores like this:

int i_0, i_1, i_2, i_3, i_4, i_5, i_6, i_7 ; // These specify the values in memory
//core 0
__m256i vec1_0 = _mm256_load_si256((__m256i *)&A[i_0]);  
__m256i vec2_0 = _mm256_load_si256((__m256i *)&B[i_0]); 
__m256i vec3_0 = _mm256_add_epi32( vec1 , vec2); 
_mm256_store_si256((__m256i *)&C[i_0], vec3_0);

//core 1
__m256i vec1_1 = _mm256_load_si256((__m256i *)&A[i_1]);
__m256i vec2_1 = _mm256_load_si256((__m256i *)&B[i_1]);
__m256i vec3_1 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_1], vec3_1);

//core 2
__m256i vec1_2 = _mm256_load_si256((__m256i *)&A[i_2]);
__m256i vec2_2 = _mm256_load_si256((__m256i *)&B[i_2]);
__m256i vec3_2 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_2], vec3_2);

//core 3
__m256i vec1_3 = _mm256_load_si256((__m256i *)&A[i_3]);
__m256i vec2_3 = _mm256_load_si256((__m256i *)&B[i_3]);
__m256i vec3_3 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_3], vec3_3);

//core 4
__m256i vec1_4 = _mm256_load_si256((__m256i *)&A[i_4]);
__m256i vec2_4 = _mm256_load_si256((__m256i *)&B[i_4]);
__m256i vec3_4 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_4], vec3_4);

//core 5
__m256i vec1_5 = _mm256_load_si256((__m256i *)&A[i_5]);
__m256i vec2_5 = _mm256_load_si256((__m256i *)&B[i_5]);
__m256i vec3_5 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_5, vec3_5);

//core 6
__m256i vec1_6 = _mm256_load_si256((__m256i *)&A[i_6]);
__m256i vec2_6 = _mm256_load_si256((__m256i *)&B[i_6]);
__m256i vec3_6 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_6], vec3_6);

//core 7
__m256i vec1_7 = _mm256_load_si256((__m256i *)&A[i_7]);
__m256i vec2_7 = _mm256_load_si256((__m256i *)&B[i_7]);
__m256i vec3_7 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_7], vec3_7);

POSIX Thread is available and also openMP could be useful in this case too. But, creating and maintaining the threads take too much time compared to almost 5+5+1 cyles for this operation. Because, all data are dependent so I don't need to watch the shared memory. What is the fastest explicit method for implementing this operation?

I work on GPPs thus, GPUs might not be the answer. I also want to implement a library so compiler base solution might be a challenger. The problem is big enough for multi threading. It's for my reasearches therefore I can change the problem to fit the concept. I want to implement a library and compare it with other solutions such as OpenMP and hopefully my library will be faster than other current solutions. GCC 6.3/clang 3.8, Linux Mint, Skylake

Thanks in advance.

Unless this is in the bottom of a deep loop, within another loop called many many times, and you have a profile of this showing it to be a bottleneck, then any method you could choose will be the fastest. Please be careful when writing the fastest code without knowing if it is indeed an issue. If this is code that needs to go faster, then go ahead and experiment and see what the profiler says. I don't even know that the above code would use multiple CPUs as is - instead I believe it would queue them all up on the same core. — Michael Dorgan, May 25 '17 at 22:07
@MichaelDorgan, Thanks, I changed the question to be more general. It's not in a loop, however, it can be. I'm implementing a multithread SIMD library for my applications and it's a simplified version of my problem. — Amiri, May 25 '17 at 22:16
One issue is that all cores compete for the same memory and L3 and/or L4 cache. If the process is memory bandwidth limited with just 1 or 2 cores, using additional cores won't help. — rcgldr, May 25 '17 at 22:39
Break down your problem into chunks. If you have a scattered memory access pattern when you split it onto many cores, you'll get little benefit from multi-threading. If instead you have a tight operation localized in memory, then sectioning it off onto many cores and operating on smaller blocks might help. You may need to add some prefetch instructions to help prepare for loads ahead of time. — paddy, May 25 '17 at 22:41
Thanks all, I will manage memory issues. The add instruction is a simple operation. It might be a multiply, permute, etc. I'm looking for a way to send the single instruction to all cores, fastly and with lessening overheads. or maybe `all cores -1` — Amiri, May 25 '17 at 22:46
Basically you are asking for a wider SIMD, so that instead of 8 ADDs in parallel, you can do 32 or 64 etc. And you don't want the cost of creating and joining of threads too. Sounds like you could use a GPU here, most of new GPUs have wider ALU lanes, hence could easily achieve your objective — Isuru H, May 26 '17 at 07:25
Splitting computation on multiple cores makes sense only when problem is big enough. In your case it absolutely doesn't. So you need to use thread functionality provided by OS. If you want to reduce thread creation overhead consider using thread pools. — Anty, May 26 '17 at 12:30
*"But, creating and maintaining the threads take too much time compared to almost 5+5+1 cyles for this operation."* Right, exactly. You seem to have already answered your own question! Each core/processor operates using a different memory area, so moving the data at the beginning and folding the results together at the end will be much slower than just executing the instructions on the same core. There is no such thing as "multi-threaded assembly language". You do not directly control from assembly language which core the instructions are executed on; that's the job of the operating system. — Cody Gray - on strike, May 26 '17 at 14:06
Thanks all, I work on GPPs thus, GPUs might not be the answer. I also want to implement a library so compiler base solution might be a challenger. The problem is big enough for multi threading. It's for my reasearches therefore I can change the problem to fit the concept. I want to implement a library and compare it with other solutions such as OpenMP and hopefully my library will be faster than other current solutions. So what do you think about using Pthreads? Could it be faster than OpenMP? Or is there any low-level tools for multithreading? — Amiri, May 27 '17 at 00:01

Regis Portalez · Accepted Answer · 2017-05-31T08:51:03.853

If you problem is large, you must multithread.

You can choose either openmp or pthread, they will give you similar performance levels (probably a bit better with pthread, but that will be non portable and way more complex to maintain).

Your code will be limited by bandwidth, absolutely not by compute.

In order to achieve maximum throughput, you need to interleave independant memory operations through multi-threading.

a very simple solution like

extern "C" void add(int* a, int* b, int* c, int N) {
    #pragma omp parallel for
    for(int i = 0; i < N; ++i) {
        a[i] = b[i] + c[i];
    }
}

will likely give you acceptable performances on all systems with every compilers.

As a matter of fact, letting the compiler optimize will likely give you good performances, and for sure help you to write readable code.

But sometimes, even the best compiler doesn't give satisfactory results (always inspect your assembly on performance critical sections).

They need help, and sometimes you need to write assembly on your own.

Here is the path I would follow to optimize this loop until I get the results I want.

First, there are classic optimization tricks you can implement:

constness and aliasing

Provide constness and prevent aliasing through __restrict keyword:

extern "C" void add(int* __restrict a, const int* __restrict b, const int* __restrict c, int N) {
    #pragma omp parallel for
    for(int i = 0; i < N; ++i) {
        a[i] = b[i] + c[i];
    }
}

This will help the compiler, since it will know that a, b, and c can't alias each other.

alignement information:

Tell the compiler your pointers are properly aligned

#define RESTRICT __restrict

    typedef __attribute__((aligned(32))) int* intptr;

    extern "C" void add(intptr RESTRICT a, const intptr RESTRICT b, const intptr RESTRICT c, int N) {
        #pragma omp parallel for
        for(int i = 0; i < N; ++i) {
            a[i] = b[i] + c[i];
        }
    }

This will also help the compiler to generate vload instruction instead of vloadu (load unaligned).

Unroll inner loops (if you can):

If you know your problem size if a multiple of 256 bits, you can even unroll an inner loop:

#define RESTRICT __restrict

typedef __attribute__((aligned(32))) int* intptr;

extern "C" void add(intptr RESTRICT a, const intptr RESTRICT b, const intptr RESTRICT c, int N) {
    #pragma omp parallel for
    for(int i = 0; i < N; i += 8) {
        #pragma unroll
        for(int k = 0; k < 8; ++k)
        a[i+k] = b[i+k] + c[i+k];
    }
}

with that code, clang 4.0 gives quite neat assembly :

...
 vmovdqu ymm0, ymmword ptr [rdx + 4*rcx]
 vpaddd  ymm0, ymm0, ymmword ptr [rsi + 4*rcx]
 vmovdqu ymmword ptr [rdi + 4*rcx], ymm0
...

For some reasons, you need to tweak your attributes and pragmas to have the same result with other compilers.

Intrinsics

If you want to get sure you have the right assembly, then you must go to intrinsics / assembly.

Something simple like:

#define RESTRICT __restrict

typedef __attribute__((aligned(32))) int* intptr;

extern "C" void add(intptr RESTRICT a, const intptr RESTRICT b, const intptr RESTRICT c, int N) {
    #pragma omp parallel for
    for(int i = 0; i < N; i += 8) {
        __m256i vb = _mm256_load_si256((__m256i*) (b + i));
        __m256i vc = _mm256_load_si256((__m256i*) (c + i));
        _mm256_store_si256((__m256i*) (a + i), _mm256_add_epi32(vb, vc));
    }
}

Non-temporal store: As a final optimization, you can use a non-temporal hint on the store instruction, since the other iteration of the loop won't read the value you just wrote:

typedef __attribute__((aligned(32))) int* intptr;
extern "C" void add(intptr RESTRICT a, const intptr RESTRICT b, const intptr RESTRICT c, int N) {
    #pragma omp parallel for
    for(int i = 0; i < N; i += 8) {
        __m256i vb = _mm256_load_si256((__m256i*) (b + i));
        __m256i vc = _mm256_load_si256((__m256i*) (c + i));
        _mm256_stream_si256((__m256i*) (a + i), _mm256_add_epi32(vb, vc));
    }
}

which gives you that assembly:

.L3:
        vmovdqa ymm0, YMMWORD PTR [rdx+rax]
        vpaddd  ymm0, ymm0, YMMWORD PTR [rsi+rax]
        vmovntdq        YMMWORD PTR [rdi+rax], ymm0
        add     rax, 32
        cmp     rcx, rax
    jne     .L3
    vzeroupper

If you feel worried about the cmp instruction at every step, you can unroll more steps in your loop, but branch prediction does a pretty good job on modern processors

[EDIT : add pthread] As stated above, pthread is a bit painful to manage... Here is a fully functional example with pthread :

#include <pthread.h>
#include <cstdlib>
#include <cstdio>
#include <immintrin.h>

typedef struct AddStruct {
    int *a, *b, *c;
    int N;
} AddStruct_t;

void* add(void* s);

int main() {
    const int N = 1024*1024*32; // out of cache
    int *a, *b, *c;
    int err;
    err = posix_memalign((void**) &a, 32, N*sizeof(int));
    err = posix_memalign((void**) &b, 32, N*sizeof(int));
    err = posix_memalign((void**) &c, 32, N*sizeof(int));
    for(int i = 0; i < N; ++i) {
        a[i] = 0;
        b[i] = 1;
        c[i] = i;
    }
int slice = N / 8;
pthread_t threads[8];
AddStruct_t arguments[8];
for(int i = 0; i < 8; ++i) {
    arguments[i].a = a + slice * i;
    arguments[i].b = b + slice * i;
    arguments[i].c = c + slice * i;
    arguments[i].N = slice;
}

for(int i = 0; i < 8; ++i) {
    if(pthread_create(&threads[i], NULL, add, &arguments[i])) {
        fprintf(stderr, "ERROR CREATING THREAD %d\n", i);
        abort();
    }
   }

for(int i = 0; i < 8; ++i) {
    pthread_join(threads[i], NULL);
}

for(int i = 0; i < N; ++i) {
    if(a[i] != i + 1) {
        fprintf(stderr, "ERROR AT %d: expected %d, actual %d\n", i, i+1, a[i]);
        abort();
    }
}

fprintf(stdout, "OK\n");
}

void* add(void* v) {
    AddStruct_t* s = (AddStruct_t*) v;
    for(int i = 0; i < s->N; i += 8) {
        __m256i vb = _mm256_load_si256((__m256i*) (s->b + i));
        __m256i vc = _mm256_load_si256((__m256i*) (s->c + i));
        _mm256_stream_si256((__m256i*) (s->a + i), _mm256_add_epi32(vb, vc));
    }
}

This code achieves 34 GB/s on my Xeon E5-1620 v3 with DDR4 memory @ 2133 MHz while the simple solution at the beginning is at 33 GB/S.

All those efforts to save 3% :). But sometimes those 3% can be critical.

Please note that the memory initialization should be performed by the same core which will perform the compute (particularly true for NUMA systems) to avoid page migrations.

Thanks, a very good answer. Could you add some explicit multithreading too? Portability is not my concern since I perform the codes on my Skylake. I used `pthreads` but my solution wasn't good. Indeed, I might change the compiler and want the library works independently. — Amiri, May 30 '17 at 19:08
I added an example with pthread, however, i would really consider openmp, for code readability. Anyway, do your performance measurements, and pick the multithreading library you prefer — Regis Portalez, May 31 '17 at 08:03
Your simple solution at the start of your answer is all that is needed. For small N you don't want to use multi-threading, and for large N it's memory bandwidth bound so unrolling, alignment, restrict are effectively useless. — Z boson, May 31 '17 at 08:48
Incidentally you can use `for simd aligned(a,b,c:32)` (I might not have the syntax correct) for alignment. — Z boson, May 31 '17 at 08:49
GCC's `-funroll-loops` optimizes worse with OpenMP than without OpenMP (I discovered this by moving a function to a separate object file and it got faster). Unrolling is something you have to do by hand still with GCC to get good results. Again, though these optimizations are only useful in cases where the operations is not memory bandwidth bound. The OP's question is too broad. — Z boson, May 31 '17 at 08:51
@Zboson: agreed, I added performance measurements. The "optimized" solution saves 3% over simple openmp. But that can be important to the OP — Regis Portalez, May 31 '17 at 08:53
this is probably pthread_create/join wich is a bit faster than _kmpc_fork_call/_kmpc_static_fini. But optimizations tricks I show are valid for other types of loops. — Regis Portalez, May 31 '17 at 09:00
Actually, I read your answer (already upvoted). Your non-temporal store solution is the only one which is better than your simple solution. I'm actually surprised you don't see more of an improvement than 3% using non-temporal stores. — Z boson, May 31 '17 at 09:00
But I don't think there is any point in using pthreads. OpenMP with non-temporal stores should be sufficient. — Z boson, May 31 '17 at 09:04
Also, you don't need AVX. SSE is sufficient (and maybe better). Here is the solution I think would be best (when it's memory bandwidth bound) https://godbolt.org/g/AEYR5z — Z boson, May 31 '17 at 09:11
I just think that `#pragma omp parallel for` is just for multi-threading and `#pragma omp simd` is for SIMDization!!!!! Am I wrong? If I am right How about this answer?????? — Amiri, Jul 27 '17 at 07:01

Holmz · Answer 2 · 2017-05-26T08:06:20.280

The fastest method for implementation is:

void add_ints(int *vec1, int *vec2, int *vec3 int n){
 int i; 
#pragma simd
for (i=0; i<n; i++){
  vec3[i] = vec1[i] + vec2[i] ;
}

Whether the "roll your own" is faster deserves some investigation. But "rolling your own" can be more prone to error... Which makes it slower to implement.

For these simple problems one would expect that the compiler writers are sophisticated enough to understand the fastest solutions for simple problems, and often they even do well finding the fastest solution for complicated problems... And the use of the #pragma helps them.

Secondly; I rarely find cases where 'SIMD parallel' works faster on IO driven problems such as ˆthisˆ, when compared to straight 'SIMD' on a single core.
I routinely achieve just under 1600 MB/second of throughput, which on 1600 memory seems pretty good.
Unless a GPU has higher IO bandwidth than 1600 MB/sec you may be better on on a single host core, and use the GPU when more math/IO is needed.

However you can and should try it to see for yourself. (yes... the following example is off the icc website)

#pragma omp parallel for simd schedule(static,10) {
  for (i=0; i<N; i++) { vec3[i] = vec1[i] + vec2[i]; }
}

After you have the easy way, then you can get some measurements on how much better the "roll you own" performs over the compiler with -O3 using both single and multiple cores.

Another choice to consider for vectors is CILK+ . This is especially true when one is coming from a MATLAB or Fortran background, as the vector and matrix/array constructs are very similar.

basically the SIMD intrinsics were 'in vogue' early on, and once the compiler and OpenMP got them into their internals then the use of intrinsics seems to be better when reserved solely for cases where the compiler is not able to provide vectored machine-code for you.

The only compiler I know of that supports `#pragma sims` is ICC, which is not what the person asking this question is using or hoping to target. It should also be completely unnecessary. If you use the appropriate compiler switches, the optimizer will make the correct decision without needing non-portable `#pragma`s scattered throughout the code base. — Cody Gray - on strike, May 26 '17 at 14:09
#pragma omp simd is supported by GCC starting from versions ~4.9-5.1. So it is supported by icc, ifort and gcc as well as some llvm-based ones. The only x86 compiler I know, which completely does not support it, is MS compiler. — zam, May 26 '17 at 20:08
#pragmas are easy for the compiler to ignore, and there may be Gcc versions of the switches? Whether the pragmas are even needed or not requires one to look at a vector report. The OP ended the question with "easiest to implement", and the pragmas are so easy even I can do it. But if the compiler optimises the codethen no pragmas are needed. Doing assembler type loads seems harder to implement. The OP has a solution looking for a problem, and a custom library should be put together on solid facts.hence trying many ways is justified. — Holmz, May 27 '17 at 02:08
Thanks, your solution is awesome, but I want to implement it explicitly. — Amiri, May 30 '17 at 19:14
That makes sense @FackedDeveloper Ideally you can post back as to how much faster it works. — Holmz, May 31 '17 at 08:48

What is the fastest way for a multithread SIMD operation explicitly?

2 Answers2