Increasing array index in openMP

Question

I am new to using OpenMP. I am trying to parallelize a nested loop, and so far I have something of this form...

#pragma omp parallel for
for (j=0;j <m; j++) {
    some work;
    for (i= 0; i < n ; i++) {
        p =b[i];
        if (P< 0 && k < m) {
            a[k] = c[i]; k++ ;
        } else {
            x=c[i];
        }
    }
    some work
}

The outer loop is in parallel, and the inner loop updates k. The current value of k is needed for the other threads to update a[k] correctly. The problem is that all of the threads are updating a[k], but the proper order of k is not kept.

Some threads will update k and a[k], and some will not. How do I communicate the latest k between threads to update a[k] properly, since c[i] will have different values for each thread?

For example, when it runs serially, the program might set the first seven values of a to {1,3,5,7,3,9,13} and terminate with k equal to 7, but when done parallel, produces different results, or results in a different (therefore wrong) order.

How do I keep the same order and ensure parallelism at the same time?

There are data dependencies among your iterations. The computation is not parallelizable in its current form. It's also not clear why you need to compute `k` inside the loop. There is a simple closed-form solution for the final value. — John Bollinger, Mar 08 '22 at 21:56

score 4 · Answer 1 · edited Mar 12 '22 at 13:27

Note: this answer was completely rewritten in light of OP clarifications. The original answer text is at the end.

How do I keep the same order and ensure parallelism at the same time?

Order dependency is antithetical to parallelism, as running operations in parallel inherently entails relaxing the relative order in which they are performed. Not all computations can be effectively parallelized.

Your case is not an exception. The second and each subsequent iteration of your outer loop needs to use the final value of k (among other things) computed by the previous iteration. How can it get that? Only by performing the previous iteration first. What room does that leave for concurrent operation? None. Concurrency is not the same thing as parallelism, but it is one of the main motivations for parallelism, because that's how parallelism yields improvements in elapsed time.

With no scope for concurrency, parallelism is actively counterproductive for you. Suppose you made the whole body of the outer loop a critical section, so that there was no concurrency in fact (as your present code requires) and no data races involving k. Then you would still pay the overhead for parallelism, get no speedup in return, and probably still get the wrong results because of evaluations of the outer-loop body being performed in the wrong order.

It may be that the whole thing can be rewritten to reduce or remove the data dependencies that prevent effective parallelization of the computation, or it may not. We haven't enough information to determine, as it depends in part on the details of "some work" and on the significance of the data. Probably you would need an altogether different algorithm for producing the desired results.

~~> Instead of giving a[n]={0,1,2,3,.......n} , it gives me garbage values for a when I use the reduction clause. I need the total sum of K, hence the reduction clause.~~

There is a closed-form equation for the sum of consecutive integers, and it has especially simple form when the first integer in the list is 0 or 1. In particular, the sum of the integers from 0 to n, inclusive, is n * (n + 1) / 2. You do not need a reduction for this.

If you wanted to use a reduction anyway, then you need to understand that it doesn't work the way you seem to think it does. What you get is a separate, private copy of the reduction variable for each thread executing the parallel construct, with the per thread (not per iteration) final values of those independant variables combined according to the reduction operator. Thus, if you really want to do the computation via an OpenMP reduction, then you would need to restructure the loop something like this:

#pragma omp parallel for reduction (+:k)
for (i = 0; i < 10; i++) {
    a[i] = i;
    k += i;
}

That assumes that the value of k is 0 immediately prior to the loop, as you indeed seem to be doing. If that were not a safe assumption then you would need something like

type_of_k k0 = k;
k = 0;
#pragma omp parallel for reduction (+:k)
for (i = 0; i < 10; i++) {
    a[k0 + i] = i;
    k += k0 + i;
}

Note that in either case, not only does that set up the reduction correctly, but it also breaks the data dependency between loop iterations that was previously carried by the expression k++.

I'm not sure your codes are correct regarding `k`. For me, you should keep `k++;` as `k+=i;` will give a very different result. The former will give `k0+n` at the end, while the later will give `k0+n*(n+1)/2` or even `n*k0+n*(n+1)/2`, which I doubt is the expected result. Actually, just setting `k+=n;` outside of the loop would be a much better solution. — Gilles, Mar 09 '22 at 08:32
@Gilles, as this answer already expresses, I have interpreted the OP to want the final value of `k` to be the sum of the values that `k` takes in each iteration of their original loop, when that loops runs serially. So yes, that's a different result than the serial version of the OP's loop produces. I read the question to be saying that this difference is exactly the reason for introducing the reduction -- notwithstanding the fact that the reduction doesn't actually do what the OP wants. — John Bollinger, Mar 09 '22 at 16:38
Actually the computation of a[k] and k++ is depended on a condition.. So the original code is k=0l for (i=0; i — S2022, Mar 09 '22 at 18:52
@S2022, that is essential information that belonged in the question to begin with. As a rule, we disapprove of changing a question that already has answers in a way that invalidates any of the answers, but in this case I will not object to you clarifying the question. Important things to add: (1) what you said in your comment above; (2) some comments about the nature of the condition involved -- especially about which data it depends upon; (3) a clearer explanation of what you expect the final value of `k` to be -- ideally including an example. — John Bollinger, Mar 09 '22 at 19:14
Sorry for the miscommunication. The code looks like this #pragma omp parallel for for (j=0;j — S2022, Mar 10 '22 at 01:11
Some thread will update k and a[k], some willl not. How do I communicate the latest k between threads to update a[k] properly. since c[i] will have different values for each thread. when computed serially if let's say a[k]={1,3,5,7,3,9,13}, k=7 but when done parallel, it shows a[k]={9,13,1,3,5,7,3} due to the lack of coordination of k between threads. How do I keep the same order and ensure parallelism at the same time. — S2022, Mar 10 '22 at 01:17
@S2022, by "clarify[] the question", I meant to *[edit it](https://stackoverflow.com/posts/71401866/edit)* to clarify. I will take care of that for you this time. — John Bollinger, Mar 10 '22 at 14:10

Yann Vernier · Answer 2 · 2022-03-12T11:11:00.753

It sounds like you're essentially filling in a with a filter of entries from c, and want to preserve their order. If this is the only use k has, some other methods spring to mind:

Always write a[i], but use a mark indicating unused values where the P predicate wasn't satisfied. This preserves order, but requires a larger a you can compact in a second pass.
Write an a_i array storing which index each entry belonged to. This still requires a #pragma omp atomic k_local = k++ access to k, and a second sort to restore order. And you'd need both a and a_i to be the full size again, or you might miss entries, so in all a terrible workaround.

Even with some sequential dependencies you can do optimizations, e.g. a scan to calculate what k would be for each i could be done in O(log n) rather than O(n). E.g. parallel prefix sum, openmp discussion on stack overflow. This sort of thing is what OpenMP's ordered depend is for, I believe. Anyhow, this leads to the third solution:

Generate a k array, holding the values k will have for each iteration, such that those threads that will write write to the correct places. This requires scanning the predicate.

It is useful to have higher level constructs like map, scan and reduce when planning out algorithms.

Increasing array index in openMP

2 Answers2