Why is parallel version slower?

Question

I want to apply specific filter on matrix. (from [0][0] sequentialy to end)

A[i][j] = 0.2 * (A[i][j] + A[i+1][j] + A[i-1][j] + A[i][j+1] + A[i][j-1])

If [i],[j] is for example [0][0] (first value in matrix), I am using zeros as values on the left side and upper side.

I am trying to understand why is parallel version of my code slower than sequential.

When I'am computing with multiple threads I am using fact that there is independent work along the diagonals. I am deliberately extending matrix by two rows and two cols (filled with zeros) to simplify computing of filter.

I also tried various dimensions of matrix (up to 7000x7000).

My problem: http://15418.courses.cs.cmu.edu/fall2017/lecture/progbasics/slide_032

Sequential version:

for (int i = 1; i < r-1; i++) {
    for (int j = 1; j < c-1; j++) {
        arr[i][j] = 0.2f * (arr[i][j] + arr[i][j - 1] + arr[i - 1][j] + arr[i][j + 1] + arr[i + 1][j]); 
        }
    }

Parallel version:

int n = r - 2;  
for (int slice = 0; slice < 2 * n - 1; ++slice) {     //along the diagonals
        int z = (slice < n) ? 0 : slice - n + 1;
        #pragma omp parallel for schedule(static) //spawns threads 
        for (int j = z; j <= slice - z; ++j) {
            pl_arr[j + 1][slice - j + 1] = 0.2f * (pl_arr[j + 1][slice - j + 1] + pl_arr[j + 1][slice - j] + pl_arr[j][slice - j + 1] + pl_arr[j + 1][slice - j + 1 + 1] + pl_arr[j + 1 + 1][slice - j + 1]);
        }
}

Rest of the code:

int r = 7000, c = 7000;

r = r + 2;
c = c + 2;

/* initialize random seed: */
srand(time(NULL));

float **arr = (float **)malloc(r * sizeof(float *));
for (int i = 0; i < r; i++)
    arr[i] = (float *)malloc(c * sizeof(float));

float **pl_arr = (float **)malloc(r * sizeof(float *));
for (int i = 0; i < r; i++)
    pl_arr[i] = (float *)malloc(c * sizeof(float));


for (int i = 0; i < r; i++) {
    for (int j = 0; j < c; j++) {
        if ((i == 0) || (i == (r - 1)) || (j == 0) || (j == (c - 1)) ){
            arr[i][j] = 0;
            pl_arr[i][j] = 0;
        }
        else {
            arr[i][j] = rand() % 99  + 1;
            pl_arr[i][j] = arr[i][j];
        }
    }
}

#pragma omp parallel for schedule(static) - The for construct splits the for-loop so that each thread in the current team handles a different portion of the loop.

Result: Paralle version is always slower than sequential

You are creating your 2d arrays in a very naive way. The row data is strewn all over the heap. If you allocated the entire 2D array data in contiguous memory, maybe the difference would be lessened. [See this](https://stackoverflow.com/questions/21943621/how-to-create-a-contiguous-2d-array-in-c/21944048#21944048) — PaulMcKenzie, Aug 25 '19 at 22:12

score 6 · Accepted Answer · answered Aug 25 '19 at 21:59

If you work out what's happening in the sequential version of the loop, you'll see that the inner loop accesses sequential memory addresses (or, more precisely, three memory ranges, each range's addresses being accessed sequentially).

Modern CPUs are very good and plowing through consecutive memory addresses. This is why std::vector can be counter-intuitively faster than std::list, in many use cases.

Now, do the same for the parallel version of the loop. Work out, on paper in pencil, what each thread ends up hitting. Looks like it's iterating vertically through the matrix, across multiple, individually-allocated rows. That won't be consecutive memory addresses, they'll be all over the place; which is less optimal.

You can do this trivially simply by having each thread capture the raw memory addresses it's busting through, and look at the combined captured log of all execution threads; now compare it with the same for the sequential version.

To add insult to injury: on typical modern architectures, memory regions are divided into larger blocks called "cache lines". Looks like the parallel version will have multiple execution threads accessing neighboring memory addresses, and many of them will fall into the same cache line; and when multiple CPU execution units have to write to the same cache line, even if to different addresses within each cache line, they have to perform a complicated song-and-dance routine, in order to avoid stepping on each other's toes.

Why is parallel version slower?

1 Answers1