Vectorization & #pragma omp simd

Question

Since I got lost through all the reading of SIMD and OpenMP depending on vectorization, I would like to ask you if somebody can clarify me the above. Specifically, I have a part of a C++ code I want to parallelize, but I am pretty stuffed at the moment and can't figure something on my own. Any help clearing out to me what exactly the vectorization is and how can I use it in the following part of code would be greatly appreciated!

for(unsigned short i=1; i<=N_a; i++) {
        for(unsigned short j=1; j<=N_b; j++) {
            temp[0] = H[i-1][j-1]+similarity_score(seq_a[i-1],seq_b[j-1]);
            temp[1] = H[i-1][j]-delta;
            temp[2] = H[i][j-1]-delta;
            temp[3] = 0.;
            H[i][j] = find_array_max(temp, 4);
            switch(ind) {
            case 0:                                  // score in (i,j) stems from a match/mismatch
                I_i[i][j] = i-1;
                I_j[i][j] = j-1;
                break;
            case 1:                                  // score in (i,j) stems from a deletion in sequence A
                I_i[i][j] = i-1;
                I_j[i][j] = j;
                break;
            case 2:                                  // score in (i,j) stems from a deletion in sequence B
                I_i[i][j] = i;
                I_j[i][j] = j-1;
                break;
            case 3:                                  // (i,j) is the beginning of a subsequence
                I_i[i][j] = i;
                I_j[i][j] = j;
                break;
            }
        }
    }

Regards!

SIMD on x86 is all about loading 16B (or 32B) of contiguous data, and doing e.g. four `float` `add`s in parallel, or two `double`, or integers of various widths. Or shuffling / blending / packed-compare to get a blend-mask / ... — Peter Cordes, Jun 04 '16 at 14:25
`H[i][j]` depending on `H[i-1][j-1]`, `H[i-i][j]` and `H[i][j-1]`, there's no direct way of either vectorising or parallelising the loops in `i` or `j`. Well, you can probably make the compiler vectorise and/or parallelise them using `#pragma omp simd` and `#pragma omp parallel for`, but the computed result will be wrong. — Gilles, Jun 04 '16 at 16:35

score 1 · Answer 1 · answered Jun 04 '16 at 14:38

1

So ind is constant for both nested loops?

You might get a compiler to auto-vectorize this for you with OpenMP. (Put the line #pragma omp simd right before either of your for loops, and see if that affects the asm when you compile with -O3. I don't know OpenMP that well, so IDK if you need other options.)

Wrap it in a function that actually compiles, so I can see what happens. (e.g. by putting the code on http://gcc.godbolt.org/ to get nicely formatted asm output).

If it doesn't auto-vectorize, it's probably not too hard to manually vectorize with Intel intrinsics for x86, since you're just initializing some arrays with the array index. Keep a vector of loop counters starting with a vector of __m128i jvec = _mm_set_epi32(3, 2, 1, 0);, and increment it with _mm_add_ps() with a vector of [ 4 4 4 4 ] (_mm_set1_epi32(4)) to increment every element by 4.

Keep a separate vector of i values, which you only modify in the outer loop, but still store in the inner loop.

See the x86 tag wiki for instruction-set stuff.

See the sse tag wiki for some SIMD guides, including this nice intro to SIMD and what it's all about.

answered Jun 04 '16 at 14:38

Peter Cordes

328,167
45
605
847

But how do i know that the output is vectorized correctly and without any dependencies? – Sarriman Jun 04 '16 at 15:12
@MrDiaman: By looking at the asm output. Or depending on the compiler, there are options to report on success/failure of autovectorization. What kind of dependencies are you worried about? – Peter Cordes Jun 04 '16 at 15:19
I am in need of accuracy on Data. So we are talking about data dependencies. I don't know how much this can help, by here https://github.com/MrDiaman/Proj/pull/1/files – Sarriman Jun 04 '16 at 15:33
@MrDiaman: Compilers will only auto-vectorizes when they can make code that always gives the same results that the C source should give. (Unless you use `-ffast-math`, in which case it's allowed to assume FP math is associative, even though that can affect the rounding.) Any other behaviour would be a compiler bug. – Peter Cordes Jun 04 '16 at 15:38
Your code uses `#pragma omp parallel` on simple initialization loops. If your arrays are gigantic, that could make sense if you use it consistently, so the same thread touches the same part of the array every time (so it might be hot in cache). I think you can use `simd` and `parallel` at the same time with OpenMP, but [this question](http://stackoverflow.com/questions/14674049/parallel-for-vs-omp-simd-when-to-use-each) doesn't say anything about that :/ – Peter Cordes Jun 04 '16 at 15:43
The arrays are indeed gigantic. And my focus is only the part of the code i mentioned above! That is why i said nothing about the rest of it. By the way as far as i realized, useing parallel and simd is not possible with OpenMp – Sarriman Jun 04 '16 at 15:47
@MrDiaman: Have you tried using `simd` on inner loops, and `parallel` on outer loops? That might work well. – Peter Cordes Jun 04 '16 at 15:50
@PeterCordes what about the dependencies of `H`in `i` and `j`? Neither parallelisation nor vectorisation of either of the two loops seems safe for me... – Gilles Jun 04 '16 at 16:38
There are no dependencies there...just the fact that there is a switch statement, which is not vectorizable! – Sarriman Jun 04 '16 at 18:28
@MrDiaman: The switch is trivially vectorizable by cloning the loop, since `ind` doesn't change inside the loop. So in the asm, you switch once to select which loop to run. – Peter Cordes Jun 04 '16 at 19:05
You keep referring to asm. Since the last GNU compiler, there is no need for asm smid code writing since the commands are translated one by one into asm..(at least that's what i have been thought ! ) ...so why you are referring to asm? – Sarriman Jun 04 '16 at 19:10
@MrDiaman: asm is the compiler output. If you want to see how well the compiler optimized, you read the assembly or run microbenchmarks. When you're trying to tune C to run fast, what you're really trying to do is give the compiler input that it will turn into efficient asm. If you know asm, it's natural to think about what the compiler is going to do. Reading the asm output is a good way to see what the compiler is failing to optimize, so you're not just blindly making changes in the C source. – Peter Cordes Jun 04 '16 at 19:14
@Gilles: oh crap, I didn't even notice there was another array being initialized at the same time. You're right, there's a data dependency on the previous `j` for `H[i][j]`, so that doesn't look vectorizable without some horizontal-add and horizontal max trickery (maybe like what you can do for [SIMD prefix sums](http://stackoverflow.com/questions/19494114/parallel-prefix-cumulative-sum-with-sse)) – Peter Cordes Jun 04 '16 at 19:14
So I just have to add a barrier or something? – Sarriman Jun 04 '16 at 19:18
@MrDiaman: What, for `H[][]`? No, it's probably easiest to keep that part scalar. How often do you have to initialize this giant array, anyway? If it's often, maybe store it in a way that pre-computes something? Or else just have that part scalar, so you have a mix of work to keep memory and CPU busy doing vector stores to initialize `I_i` and `I_j` overlapping with the scalar `H[]` stuff. – Peter Cordes Jun 04 '16 at 19:24
Optimizing your whole program is a way bigger task than can fit in one SO question. Even optimizing this one loop depends on the rest of your program, since there's not much you can do by just changing the loop on it's own. I guess maybe some manual vectorization with intrinsics like I suggested in my last comment. If you want to pay me for it, I'd love to take a look and see what I can do. (I have experience optimizing phylogenetics programs, from working with the group at Dalhousie University.) – Peter Cordes Jun 04 '16 at 19:27
I am just into this project to make some knowledge out of it! Thank you for the help so far though its been a great deal of help for me! :) – Sarriman Jun 04 '16 at 19:32

Vectorization & #pragma omp simd

1 Answers1