1

I'm trying to write Matrix by vector multiplication in C (OpenMP) but my program slows when I add processors...

1 proc - 1,3 s
2 proc - 2,6 s
4 proc - 5,47 s

I tested this on my PC (core i5) and our school's cluster and the result is the same (program slows)

here is my code (matrix is 10000 x 10000) and vector is 10000:

double start_time = clock();
#pragma omp parallel private(i) num_threads(4)
{
    tid = omp_get_thread_num();
    world_size = omp_get_num_threads();
    printf("Threads: %d\n",world_size);

    for(y = 0; y < matrix_size ; y++){
        #pragma omp parallel for private(i) shared(results, vector, matrix)
        for(i = 0; i < matrix_size; i++){
                results[y] = results[y] + vector[i]*matrix[i][y];   
        }
    }
}
double end_time = clock();
double result_time = (end_time - start_time) / CLOCKS_PER_SEC;
printf("Time: %f\n", result_time);

My question is: is there any mistake? For me it seems pretty straightforward and should speed up

Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
damian
  • 316
  • 3
  • 7
  • 23

3 Answers3

5

I essentially already answer this question parallelizing-matrix-times-a-vector-by-columns-and-by-rows-with-openmp.

You have a race condition when you write to results[y]. To fix this, and still parallelize the inner loop, you have to make private versions of results[y], fill them in parallel, and then merge them in a critical section.

In the code below I assume you're using double, replace it with float or int or whatever datatype you're using (note that your inner loop goes over the first index of matrix[i][y] which is cache unfriendly).

#pragma omp parallel num_threads(4)
{
    int y,i;
    double* results_private = (double*)calloc(matrix_size, sizeof(double));
    for(y = 0; y < matrix_size ; y++) {
        #pragma omp for
        for(i = 0; i < matrix_size; i++) {
            results_private[y] += vector[i]*matrix[i][y];   
        }
    }
    #pragma omp critical
    {
        for(y=0; y<matrix_size; y++) results[y] += results_private[y];
    }
    free(results_private);
}

If this is homework assignment and you want to really impress your instructor then it's possible to do the merging without a critical section. See this link to get an idea on what to do fill-histograms-array-reduction-in-parallel-with-openmp-without-using-a-critic though I can't promise it will be faster.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • I don't know if I'm correctly interpreting this code. – damian May 02 '14 at 13:28
  • @damian, what do you mean? You mean you're not getting the right answer with it? – Z boson May 02 '14 at 13:31
  • Sorry I accidentally pressed enter :) I don't know if I'm correctly interpreting this code. How it works? results_private is shared among threads or it is private? If it's private: every thread has it's own y variable, and its own results_private and the only thing that is paralleled here is the "i" which will be divided into pieces for example - thead 0 iterating i 0-5000 and thread 1 iterating 5000-10000? But how can you be sure that critical section will be entered by threads in correct order? – damian May 02 '14 at 13:35
  • @damian, here is a working example which get's the correct result http://coliru.stacked-crooked.com/a/b172dce3c707a55b – Z boson May 02 '14 at 13:35
  • @damian, `results_private` is private (hence the suffix private). All variables defined in a parallel section are private (so `y` and `i` are private as well). It does not matter which order the threads enter the critical section (it only matters that when a thread enters it no other thread can enter until that thread finishes). – Z boson May 02 '14 at 13:38
  • Ok thank you very much for help and great answer. One more In relation to this examle how can you be sure that output won't be for example 26, 20, 32? Oh ok, that was stupid question! :) – damian May 02 '14 at 13:40
  • @damian, the code is essentially calculating partial sums which don't depend on the thread number (i.e. the order). So the partial sums are done in parallel and the full sum is done serially. The point is that the partial sums need O(n^2) calculations and merging only O(n) so the O(n^2) operation dominates (for large enough n). – Z boson May 02 '14 at 13:44
  • Ok I tried this one but still: 1 thread - 3,5s , 2 threads - 3,6s, 4 threads - 6 sec. I tried also increasing matrix size and it didn't help. (still faster on 1 thread) I'm compiling: gcc -Wall wektor2.c -o wektor2.o -fopenmp – damian May 02 '14 at 13:53
  • @damian, add `-O3` to your compiler options. – Z boson May 02 '14 at 13:59
  • Did not helped, time's still the same. – damian May 02 '14 at 14:03
  • @damian, what is your objective? To parallelize the inner loop or to make your method faster? If you want to make it faster then parralelize the outer loop as Sergey L. suggested. My answer is for parallelizing the inner loop (correctly). For a 10000x10000 matrix merging the partial sums could be slow I guess. In any case for matricies that size this will be memory bound so if you optimize correctly your code won't be any faster with multiple threads. – Z boson May 02 '14 at 14:18
  • my objective was to make my method faster:) – damian May 02 '14 at 17:57
  • @damian, in that case you need to fix your cache usage and not worry about threads. You want to do `vector[i]*matrix[y][i]`. Matrix*vector does O(n^2) loads and O(n^2) calculations os it's memory bound and not computation bound. – Z boson May 02 '14 at 19:50
  • 1
    It was all about using clock() method for time measuring. While using omp_get_wtime() program speeds up so this is solution to my problem. – damian May 06 '14 at 14:31
  • @damian, I'm glad you figured it out. I should have pointed that out (I alyways use `omp_get_wtime()`). Your code had other problems as well (a race condition) which I fixed. It's also still cache unfriendly. – Z boson May 06 '14 at 14:34
1

I've not done any parallel programming for a while now, nor any Maths for that matter, but don't you want to split the rows of the matrix in parallel, rather than the columns?

What happens if you try this:

double start_time = clock();
#pragma omp parallel private(i) num_threads(4)
{
tid = omp_get_thread_num();
world_size = omp_get_num_threads();
printf("Threads: %d\n",world_size);

#pragma omp parallel for private(y) shared(results, vector, matrix)
for(y = 0; y < matrix_size ; y++){

    for(i = 0; i < matrix_size; i++){
            results[y] = results[y] + vector[i]*matrix[i][y];   
    }
}
}
double end_time = clock();
double result_time = (end_time - start_time) / CLOCKS_PER_SEC;
printf("Time: %f\n", result_time);

Also, are you sure everything's OK compiling and linking with openMP?

James
  • 3,957
  • 4
  • 37
  • 82
0

You have a typical case of cache conflicts.

Consider that a cache line on your CPU is probably 64 bytes long. Having one processor/core write to the first 4 bytes (float) causes that cache line to be invalidated on every other L1/L2 and maybe L3. This is a lot of overhead.

Partition your data better!

 #pragma omp parallel for private(i) shared(results, vector, matrix) schedule(static,16)

should do the trick. Increase the chunksize if this does not help.

Another optimisation is to store the result locally before you flush it down to memory.

Also, this is an OpenMP thing, but you don't need to start a new parallel region for the loop (each mention of parallel starts a new team):

#pragma omp parallel default(none) \
        shared(vector, matrix) \
        firstprivate(matrix_size) \
        num_threads(4)
{
    int i, y;
    #pragma omp for schedule(static,16)
    for(y = 0; y < matrix_size ; y++){
        double result = 0;
        for(i = 0; i < matrix_size; i++){
                results += vector[i]*matrix[i][y];   
        }
        result[y] = result;
    }
}
Sergey L.
  • 21,822
  • 5
  • 49
  • 75