I want to apply specific filter on matrix. (from [0][0] sequentialy to end)
A[i][j] = 0.2 * (A[i][j] + A[i+1][j] + A[i-1][j] + A[i][j+1] + A[i][j-1])
If [i],[j] is for example [0][0] (first value in matrix), I am using zeros as values on the left side and upper side.
I am trying to understand why is parallel version of my code slower than sequential.
When I'am computing with multiple threads I am using fact that there is independent work along the diagonals. I am deliberately extending matrix by two rows and two cols (filled with zeros) to simplify computing of filter.
I also tried various dimensions of matrix (up to 7000x7000).
My problem: http://15418.courses.cs.cmu.edu/fall2017/lecture/progbasics/slide_032
Sequential version:
for (int i = 1; i < r-1; i++) {
for (int j = 1; j < c-1; j++) {
arr[i][j] = 0.2f * (arr[i][j] + arr[i][j - 1] + arr[i - 1][j] + arr[i][j + 1] + arr[i + 1][j]);
}
}
Parallel version:
int n = r - 2;
for (int slice = 0; slice < 2 * n - 1; ++slice) { //along the diagonals
int z = (slice < n) ? 0 : slice - n + 1;
#pragma omp parallel for schedule(static) //spawns threads
for (int j = z; j <= slice - z; ++j) {
pl_arr[j + 1][slice - j + 1] = 0.2f * (pl_arr[j + 1][slice - j + 1] + pl_arr[j + 1][slice - j] + pl_arr[j][slice - j + 1] + pl_arr[j + 1][slice - j + 1 + 1] + pl_arr[j + 1 + 1][slice - j + 1]);
}
}
Rest of the code:
int r = 7000, c = 7000;
r = r + 2;
c = c + 2;
/* initialize random seed: */
srand(time(NULL));
float **arr = (float **)malloc(r * sizeof(float *));
for (int i = 0; i < r; i++)
arr[i] = (float *)malloc(c * sizeof(float));
float **pl_arr = (float **)malloc(r * sizeof(float *));
for (int i = 0; i < r; i++)
pl_arr[i] = (float *)malloc(c * sizeof(float));
for (int i = 0; i < r; i++) {
for (int j = 0; j < c; j++) {
if ((i == 0) || (i == (r - 1)) || (j == 0) || (j == (c - 1)) ){
arr[i][j] = 0;
pl_arr[i][j] = 0;
}
else {
arr[i][j] = rand() % 99 + 1;
pl_arr[i][j] = arr[i][j];
}
}
}
#pragma omp parallel for schedule(static) - The for construct splits the for-loop so that each thread in the current team handles a different portion of the loop.
Result: Paralle version is always slower than sequential