0

I have n local copies of matrices,say 'local', in n threads. I want to update a global shared matrix 's' with its elements being sum of corresponding elements of all local matrices. For eg. s[0][0] = local_1[0][0] + local_2[0][0]+...+local_n[0][0].

I wrote the following loop to achieve it -

#pragma omp parallel for
for(int i=0;i<rows;i++)
{   
    for(int j=0;j<cols;j++)
        s[i][j]=s[i][j]+local[i][j];
}  

This doesn't seem to work. Could someone kindly point out where am I going wrong?

Updated with example -

Suppose there are 3 threads, with following local matrices -


thread 1
local =  1  2
         3  4

thread 2 
local =  5  6 
         7  8

thread 3 
local =  1  0
         0  1

shared matrix would then be 

s     =  7  8 
        10 13

2 Answers2

0

Throughout this answer I'm assuming that you have correctly created a private version of local on each thread as your question and example, but not your code snippet, indicate.

As you've written the code the variable i is private, that is each thread has it's own copy. Since it's the iteration variable for the outermost loop each thread will get it's own set of values to work on. Supposing that you have 3 threads and 3 rows then thread 0 will get i value 0, thread 1 will get 1, and so on. Obviously (or not) with more rows to iterate over each thread would get more i values to work on. In all cases each thread will get a disjoint subset of the set of all values that i takes.

However, if thread 0 gets only i==0 to work on the computation

s[i][j]=s[i][j]+local[i][j];

will only ever work on the 0-th row of local on thread 0. With the example I'm using i, on thread 0, never equals 1 so the values in the 1-th row of local on thread 0 never gets added to row 1 of s.

Between them the 3 threads will update the 3 rows of s but each will only add its own row of its own version of local.

As for how to do what you want to do, have a look at this question and the accepted answer. You are attempting an array reduction which, for reasons explained here, is not directly supported in C or C++.

Community
  • 1
  • 1
High Performance Mark
  • 77,191
  • 7
  • 105
  • 161
0

This should be a comment to the last paragraph of the answer, if I was permitted to do so.
The first method in the referenced question is parallelizing the array filling but not the array reduction. According to the specs (v4 p122):
The critical construct restricts execution of the associated structured block to a single thread at a time.
Each thread reduces its own part of the array, but only one after the other, in essence the code is run serially. The only reason for the summing loop to be inside the parallel region is that the arrays are local to each thread which makes sense only when filling them benefits from the parallelism.

NameOfTheRose
  • 819
  • 1
  • 8
  • 18