2

I have an O(n^3) matrix multiplication function in C.

void matrixMultiplication(int N, double **A, double **B, double **C, int threadCount) {
  int i = 0, j = 0, k = 0, tid;

  pragma omp parallel num_threads(4) shared(N, A, B, C, threadCount) private(i, j, k, tid) { 
    tid = omp_get_thread_num();
    pragma omp for
      for (i = 1; i < N; i++) 
      {
        printf("Thread %d starting row %d\n", tid, i);
        for (j = 0; j < N; j++)
        {
          for (k = 0; k < N; k++) 
          {
            C[i][j] = C[i][j] + A[i][k] * B[k][j];
          }
        }
      }
    }
    return; 
    }

I am using OpenMP to parallelize this function by splitting up the multiplications. I am performing this computation on square matrices of size N = 3000 with a 1.8 GHz Intel Core i5 processor.

This processor has two physical cores and two virtual cores. I noticed the following performances for my computation

  • 1 thread: 526.06s
  • 2 threads: 264.531
  • 3 threads: 285.195
  • 4 threads: 279.914

I had expected my gains to continue until the setting the number of threads equal to four. However, this obviously did not occur.

Why did this happen? Is it because the performance of a core is equal to the sum of its physical and virtual cores?

Brian
  • 7,098
  • 15
  • 56
  • 73
  • 4
    The code you have written does not work good with caches. 1. use local variable for partial result storage, 2. transpose B matrix before computation. See https://www.youtube.com/watch?v=WDIkqP4JbkE for more details :) – luantkow Feb 16 '15 at 18:58
  • what compiler (and version) and what optimization flags are you using? – Basile Starynkevitch Feb 17 '15 at 06:21
  • I'm not sure what you mean by virtual cores. Does your processor have hyper-threading (many i5 processors do not)? What exactly is your processor. Your code scales linearly with your two physical cores. You can't expect much more after that. Hyper-threading could help a little (but if you optimized enough it will be worse) but not much. – Z boson Feb 17 '15 at 07:43
  • @user3188346, you're correct that his code does not make good use of the cache but that does not really answer his question. – Z boson Feb 17 '15 at 07:45
  • @Zboson this is probably the reason why I did not answer this question. – luantkow Feb 17 '15 at 08:52
  • Call DGEMM instead. This sort of test is pretty meaningless since the baseless performance is 1000x off of what it should be. – Jeff Hammond Mar 29 '15 at 22:42

2 Answers2

2

Using more than one hardware thread per core can help or hurt, depending on circumstances.

It can help if one hardware thread stalls because of a cache miss, and the other hardware thread can keep going and keep the ALU busy.

It can hurt if each hardware thread forces evictions of data needed by the other thread. That is the threads destructively interfere with each other.

One way to address the problem is to write the kernel in a way such that each thread needs only half the cache. For example, blocked matrix multiplication can be used to minimize the cache footprint of a matrix multiplication.

Another way is to write the algorithm in a way such that both threads operate on the same data at the same time, so they help each other bring data into cache (constructive interference). This approach is admittedly hard to do with OpenMP unless the implementation has good support for nested parallelism.

Arch D. Robison
  • 3,829
  • 2
  • 16
  • 26
  • I have not heard of this term of constructive interference with parallelism. That's interesting. I have observed that that multiple threads get more memory bandwidth which surprised me at first. I guess that's the same idea. Is that what you are referring to? – Z boson Feb 18 '15 at 07:25
  • It's not multiple threads getting more memory bandwidth, but using given bandwidth more efficiently by incurring fewer cache misses by having each thread reuse data brought in by another thread. E.g., for a blocked matrix multiply A*B, bring in 2x2 block portions of A and B and use 4 threads to compute each block of the 2x2 result. Each block of A and B will be used twice, so only half the bandwidth was necessary for reading the input blocks compared to letting 4 threads work on arbitrary blocks. – Arch D. Robison Feb 18 '15 at 16:30
1

I guess that the bottleneck is the memory (or L3 CPU cache) bandwidth. Arithmetic is quite cheap these days.

If you can afford it, try to benchmark the same code with the same data on some more powerful processor (e.g. some socket 2013 i7)

Remember that on today's processors, a cache miss lasts as long as several hundred instructions (or cycles): RAM is very slow w.r.t. cache or CPU.

BTW, if you have a GPGPU you could play with OpenCL.

Also, it is probable that linear software packages like LAPACK (or some other numerical libraries) are more efficient than your naive matrix multiplication.

You could also consider using __builtin_prefetch (see this)

BTW, numerical computation is hard. I am not expert at all, but I met people who worked dozens of years in it (often after a PhD in the field).

Community
  • 1
  • 1
Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • Thanks for this. Can you elaborate a little more? Should I be expecting better performance with each additional thread for num_threads = [1,4]? – Brian Feb 16 '15 at 22:45
  • No, I cannot elaborate, it is an educated guess.... It is well known today that cache issues are today more important than the ALU performance. – Basile Starynkevitch Feb 17 '15 at 06:06