4

I'm running the following code for matrix multiplication the performance of which I'm supposed to measure:

for (int j = 0; j < COLUMNS; j++)
#pragma omp for schedule(dynamic, 10)
    for (int k = 0; k < COLUMNS; k++)
        for (int i = 0; i < ROWS; i++)
            matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];

Yes, I know it's really slow, but that's not the point - it's purely for performance measuring purposes. I'm running 3 versions of the code depending on where I put the #pragma omp directive, and therefore depending on where the parallelization happens. The code is run in Microsoft Visual Studio 2012 in release mode and profiled in CodeXL.

One thing I've noticed from the measurements is that the option in the code snippet (with parallelization before the k loop) is the slowest, then the version with the directive before the j loop, then the one with it before the i loop. The presented version is also the one which calculates a wrong result because of race conditions - multiple threads accessing the same cell of the result matrix at the same time. I understand why the i loop version is the fastest - all the particular threads process only part of the range of the i variable, increasing the temporal locality. However, I don't understand what causes the k loop version to be the slowest - does it have something to do with the fact that it produces the wrong result?

M. Zdunek
  • 73
  • 5
  • 1
    Why don't you swap the inner and outermost loop like [I explained yesterday](http://stackoverflow.com/a/34873966/2542702). It's trivial to implement. – Z boson Jan 20 '16 at 08:09

2 Answers2

3

Of course race conditions can slow the code down. When two or more threads access the same part of memory (same cache line), that part must be loaded into the cache of the given cores over and over again as the the other thread invalidates the content of the cache by writing into it. They compete for a shared resource.

When two variables located too close in memory are written and read by more threads, it also results in a slowdown. This is known as false sharing. In your case it is even worse, they are not just too close, they even coincide.

NathanOliver
  • 171,901
  • 28
  • 288
  • 402
1

Your assumption is correct. But if we are talking about performance, and not just validating your assumption, there is more to the story.

  • The order of your indexes is a big issue, multi-threaded or not. Given than the distance between mat[x][y] and mat[x][y+1] is one, while the distance between mat[x][y] and mat[x+1][y] is dim(mat[x]) You want x to be the outer index and y the inner to have the minimal distance between iteration. Given __[i][j] += __[i][k] * __[k][j]; you see that the proper order for spacial locality is i -> k -> j.
  • Whatever the order, there is one value which can be saved for later. Given your snippet

    for (int j = 0; j < COLUMNS; j++)
        for (int k = 0; k < COLUMNS; k++)
            for (int i = 0; i < ROWS; i++)
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
    

matrix_b[k][j] value will be fetched from memory i times. You could have started from

    for (int j = 0; j < COLUMNS; j++)
        for (int k = 0; k < COLUMNS; k++)
            int temp = matrix_b[k][j];
            for (int i = 0; i < ROWS; i++)
                matrix_r[i][j] += matrix_a[i][k] * temp;

But given that you are writing to matrix_r[i][j], the best access to optimize is matrix_r[i][j], given that writing is slower than reading

Unnecessary write accesses to memory

for (int i = 0; i < ROWS; i++)
    matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];

will write to the memory of matrix_r[i][j] ROWS times. Using a temporary variable would reduce the accesses to one.

    for (int i = 0; i < ...; j++)
        for (int j = 0; j < ...; k++)
            int temp = 0;
            for (int k = 0; k < ...; i++)
                temp += matrix_a[i][k] * matrix_b[k][j];
            matrix_r[i][j] = temp;

This decreases write accesses from n^3 to n^2.

  • Now you are using threads. To maximize the efficiency of multithreading you should isolate as much a thread memory access from the others. One way to do it would be to give each thread a column, and prefect that column once. One simple way would be to have the transpose of matrix_b such that

    matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j]; becomes 
    matrix_r[i][j] += matrix_a[i][k] * matrix_b_trans[j][k];
    

such that the most inner loop on k always deal with contiguous memory respective to matrix_a and matrix_b_trans

    for (int i = 0; i < ROWS; j++)
        for (int j = 0; j < COLS; k++)
            int temp = 0;
            for (int k = 0; k < SAMEDIM; i++)
                temp += matrix_a[i][k] * matrix_b_trans[j][k];
            matrix_r[i][j] = temp;
UmNyobe
  • 22,539
  • 9
  • 61
  • 90
  • I suspect that many of the optimizations you describe above are applied automatically by the compiler anyway (only when compiler optimizations are enabled, of course) – Jeremy Friesner Jan 20 '16 at 03:07
  • 2
    @JeremyFriesner, no the compiler won't rearrange loops and create a temporary variable. The right solution here is to swap the innermost and outermost loop. – Z boson Jan 20 '16 at 11:43
  • 1
    I can't speak to your particular compiler, of course, but in general C++ optimizers are *allowed to* do those optimizations (as long as the optimizations don't change any user-visible behavior of the program), and modern C++ compilers (g++, MSVC, clang, etc) typically do implement them. But don't take my word for it, here are some links describing it: https://en.wikipedia.org/wiki/Loop_optimization http://www.compileroptimizations.com/category/hoisting.htm – Jeremy Friesner Jan 20 '16 at 15:32
  • 1
    @JeremyFriesner While compilers sometimes can, it's often not possible because the compiler sees that reordering the loops will yield different results (due to floating point arithmetic not being associative). It's definitely a good idea to reorder loops yourself whenever possible. – NoseKnowsAll Jan 20 '16 at 15:55
  • @JeremyFriesner, I'm referring to reordering the loops not the compiler determining that a value is constant within the loop. Did you notice that in the OPs code the writes are *not* constant within the loop? In fact if the compiler reorder the loops and did the reduction over `k` so the reduction was constant within the loop it would not be efficient (though it would be better than the OPs current code). The best solution swaps the innermost and outermost loop but then the writes are still not constant within the loop. – Z boson Jan 21 '16 at 08:29
  • 1
    @Zboson: I've read that one of the SPEC benchmarks was "broken" by compilers doing a loop-interchange optimization. I forget the details, but IIRC Sun was the first vendor manage it. Their [compile options for SPECcpu2000](https://www.spec.org/cpu2000/flags/SUN-20030413.html) document some stuff about loop interchange. I do remember reading that this optimization was quite brittle, and mostly only worked on things that were very similar to the source of that one loop in that one SPEC component. I don't think this is the same as the `462.libquantum` parallelization thing. – Peter Cordes Feb 15 '16 at 17:24
  • So you're correct, it's still not a good idea to count on the compiler doing loop-interchange for you. But it is allowed, and done in practice in limited circumstances. I did find good discussion of 462.libquantum here: http://mrob.com/pub/comp/benchmarks/spec.html – Peter Cordes Feb 15 '16 at 17:25
  • @PeterCordes, yeah, I was thinking of asking a question about this after I wrote that comment because it seems that it's (loop-interchange) something a compiler could do. I may still ask a question about it but feel free to ask one yourself. I don't have much time these days and you probably have more info anyway now. – Z boson Feb 16 '16 at 09:13
  • @Zboson: I probably won't get around to doing that. I'm not that curious. My understanding it that it's basically still an unsolved problem, because it's too hard to prove that it's safe. As I understand it, the only reason some compilers can do it for SPEC is because they chose to recognize that special case, not because it was an easy "general" case. I didn't look at the code at the time when I saw this discussion. I'm pretty sure it was on http://realworldtech.com/ forums, which unfortunately aren't searchable (or even indexed by google :/) – Peter Cordes Feb 16 '16 at 09:22
  • 1
    @PeterCordes, I am starting to wonder when compilers will start using machine learning for optimization (we can call it deep optimization). I guess GCC can already to this to some degree wtih profiling but I have never tried this. Maybe the JIT compilers (I mean e.g. C#/Java) do this already. – Z boson Feb 16 '16 at 09:34