Parallel for loop for addition of local matrices in OpenMP

Question

I have n local copies of matrices,say 'local', in n threads. I want to update a global shared matrix 's' with its elements being sum of corresponding elements of all local matrices. For eg. s[0][0] = local_1[0][0] + local_2[0][0]+...+local_n[0][0].

I wrote the following loop to achieve it -

#pragma omp parallel for
for(int i=0;i<rows;i++)
{   
    for(int j=0;j<cols;j++)
        s[i][j]=s[i][j]+local[i][j];
}

This doesn't seem to work. Could someone kindly point out where am I going wrong?

Updated with example -

Suppose there are 3 threads, with following local matrices -


thread 1
local =  1  2
         3  4

thread 2 
local =  5  6 
         7  8

thread 3 
local =  1  0
         0  1

shared matrix would then be 

s     =  7  8 
        10 13

What does it do? How are the variables defined? How should the result look like? Prepare a full example. — Vladimir F Героям слава, Jan 28 '15 at 21:21
I have updated the question with an example. Like I have mentioned, variable 'local' is local to threads and variable 's' is shared. — CodeEnthusiast, Jan 29 '15 at 16:54
What does it do, show your result! Add the declarations to the code. Read http://stackoverflow.com/help/mcve and http://stackoverflow.com/help/how-to-ask otherwise you risk that your question will be closed and deleted. Never use "it doesn't work" in a good question, always explain what it does instead. — Vladimir F Героям слава, Jan 29 '15 at 17:14
I don't see any `private` in your code, how do you assure it is local? — Vladimir F Героям слава, Jan 29 '15 at 17:16
I am actually computing covariance matrix and each thread holds a separate block of data. The process of adding up local copies to obtain the final shared matrix is the final step. I abstracted out the details to make the question simpler. I have declared 'local' as private and 's' as shared in #pragma omp parallel directive (not shown in the question). I did not paste entire code as it is very bulky. Thanks — CodeEnthusiast, Jan 29 '15 at 19:28

score 0 · Accepted Answer · edited May 23 '17 at 11:57

Throughout this answer I'm assuming that you have correctly created a private version of local on each thread as your question and example, but not your code snippet, indicate.

As you've written the code the variable i is private, that is each thread has it's own copy. Since it's the iteration variable for the outermost loop each thread will get it's own set of values to work on. Supposing that you have 3 threads and 3 rows then thread 0 will get i value 0, thread 1 will get 1, and so on. Obviously (or not) with more rows to iterate over each thread would get more i values to work on. In all cases each thread will get a disjoint subset of the set of all values that i takes.

However, if thread 0 gets only i==0 to work on the computation

s[i][j]=s[i][j]+local[i][j];

will only ever work on the 0-th row of local on thread 0. With the example I'm using i, on thread 0, never equals 1 so the values in the 1-th row of local on thread 0 never gets added to row 1 of s.

Between them the 3 threads will update the 3 rows of s but each will only add its own row of its own version of local.

As for how to do what you want to do, have a look at this question and the accepted answer. You are attempting an array reduction which, for reasons explained here, is not directly supported in C or C++.

score 0 · Answer 2 · answered Jan 29 '15 at 18:37

This should be a comment to the last paragraph of the answer, if I was permitted to do so.
The first method in the referenced question is parallelizing the array filling but not the array reduction. According to the specs (v4 p122):
The critical construct restricts execution of the associated structured block to a single thread at a time.
Each thread reduces its own part of the array, but only one after the other, in essence the code is run serially. The only reason for the summing loop to be inside the parallel region is that the arrays are local to each thread which makes sense only when filling them benefits from the parallelism.

Parallel for loop for addition of local matrices in OpenMP

2 Answers2