1

I'm trying matrix multiplication using MPI and would like to ask some help to understang one issue. The machine has 6 cores, 32KB L1 cache, 256KB L2 cache and 15MB L3 cache. And multiplication goes like this:

vector<vector<double>> mult_mpi(vector<vector<double>> m, 
                                vector<vector<double>> n) { 
    int rows = m.size();
    int size = n.size();
    vector<vector<double>> r(rows, vector<double>(size));

    for (int i = 0; i < rows; ++i) 
        for (int k = 0; k < size; ++k) 
            for (int j = 0; j < size; ++j) 
                r[i][j] += m[i][k] * n[k][j];
    return r;
}

I have the same for int:

vector<vector<int>> mult_mpi(vector<vector<int>> m, vector<vector<int>> n);

Then I made some plots, different line colors indicate the number of nodes.

The following plot shows time spent to multiply two int matrices:

enter image description here

And the following plot shows time spent to multiply two double matrices:

enter image description here

Why I get the same times for 4 and 6 nodes in the double case? Am I running into the limit on the bandwidth to memory?

I tried multiple times int the last hour, same result. Also checked machine load with top but to my eyes I'm alone there.

KcFnMi
  • 5,516
  • 10
  • 62
  • 136
  • I don't know the answer to your question, but just so you're aware, this is a slow implementation of matrix multiplication. – harold Jul 25 '18 at 08:48
  • Would you suggest me an alternative? Since I'm just starting, the simple the better, I guess. Would you say the time I'm getting is compatible with the hardware? – KcFnMi Jul 25 '18 at 08:55
  • Are you getting the a similar plot consistently? Or it happened just once? It might be related to the load on the CPU at the particular moment you made your measurements. – Ivan Rubinson Jul 25 '18 at 09:06
  • 1
    Unfortunately everything that makes matrix multiplication fast also makes it more complicated.. For reference, on a Sandy Bridge E (I guess you use i7-3960X?) your DP FPop/cycle ceiling is 4 multiplies and 4 adds, you can work out a minimum time from that based on clock speed (depends on how many cores are active), core count, and matrix size, then you can see how close you are to that – harold Jul 25 '18 at 09:07
  • 1
    I also suggest [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page), a really easy to use library for matrix algebra. – bracco23 Jul 25 '18 at 09:12
  • @harold its Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz – KcFnMi Jul 25 '18 at 09:29
  • @IvanRubinson Yes, I tried multiple times in the last hour, same result. – KcFnMi Jul 25 '18 at 09:32
  • @KcFnMi: maybe if you transpose the matrix in memory which is read in column-order before multiplication, you'll get better results. (you'll need to use a transpose algorithm, which is cache-friendly). But maybe there are better algorithms. – geza Jul 25 '18 at 09:35
  • @KcFnMi ok different numbers then: 8 FP FMAs per cycle (8 adds + 8 multiplies but only if they appear together). I've posted [some tips](https://codereview.stackexchange.com/a/193734/36018) for fast MMM before that you could use if you're interested – harold Jul 25 '18 at 09:38
  • For a specific discussion of your observed performance, you should prepare a [mcve] and provide a better system specification including compiler etc. – Zulan Jul 25 '18 at 09:47
  • At a guess, you're probably running into the limit on the bandwidth to memory. More cores won't change that. – Jerry Coffin Jul 25 '18 at 09:52
  • 1
    As a comparison, on my Haswell (4770K, 3.9GHz) multiplying two 4096x4096 matrixes together on a single core takes about 1.5 seconds, and it's not optimal yet. Note that MMM is rather unique in that it does not need to be limited by memory bandwidth, since there is a cubic amount of arithmetic and only a square amount of data - but it *will* be limited by memory bandwidth if you don't implement tiling. – harold Jul 25 '18 at 09:55
  • Am I implement tiling? Here I just divide the rows across nodes (i.e. 6000x6000 each node will receive 1000 rows) and then each node goes to that multiplication code. – KcFnMi Jul 25 '18 at 10:38
  • @KcFnMi yes a little, but the tiles are still much larger than the caches – harold Jul 25 '18 at 10:49
  • OT, but is it by intention that you did not use k as the innermost iteration? That could decrease frequency of cache misses. – Red.Wave Jul 25 '18 at 12:01
  • Curious if your compiler is using the optimal vector math for this as well. – Michael Dorgan Jul 25 '18 at 23:54
  • @Red.Wave Yes, I tried to use k as the innermost iteration before. But having it there seems to be slower. To my understanding, that will cause longer jumps (next row at each iteration*) on the matrix. While in the way it is now, the jump is to the next element. *I'm not sure if having all matrix in a continuos peace o memory will help. – KcFnMi Jul 27 '18 at 00:43
  • Currently your code doesn't contain any MPI instructions! Therefore we can't really tell what is going wrong or right. Could you please provide a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). – jan.sende Jul 28 '18 at 10:26
  • There is a related discussion at https://stackoverflow.com/questions/1437501/what-is-a-good-free-open-source-blas-lapack-library-for-net-c/52138388#52138388 – Robert van de Geijn Sep 02 '18 at 15:30
  • https://stackoverflow.com/questions/1437501/what-is-a-good-free-open-source-blas-lapack-library-for-net-c/52138388#52138388 – Robert van de Geijn Sep 02 '18 at 15:31

1 Answers1

1

Are you sure that you aren't timing the allocation of 4K vector<>'s...?

vector<vector< >> is not a suitable type to squeeze optimal performance. The matrix multiplication is one of the best operation regarding scalability and "computation density" with respect to memory accesses. In fact the number of operations scale as O(N^3) while the number of data as O(N^2).

In fact it's used to benchmark top500 fastest systems on earth: HPL is for "high performance linpack", being linpack a reference implementation for linear algebral. Guess what... The operation being used in the benchmarks is DGEMM, that is "Double precision GEneral Matrix Matrix multiply".

DGEMM is the name of the operation in the BLAS library, e de-facto standard for linear algebra. Today there are many locally optimized BLAS library either commercial (INTEL MKL, IBM ESSL,...) and open source (ATLAS), but all of them use the same original (originally fortran, now C too) BLAS interface. (NOTE: the original implementation is not optimized)

Based on BLAS there are also LAPACK libraries: system solver, eigensystems,... There are also optimized lapack libraries, but usually 90% of the performance is squeezed by using optimized BLAS library.

I know very well one (not the only one... HPL is another one) powerful MPI based parallel library, which is SCALAPACK, and it contains PBLAS (parallel BLAS), and in it... an optimized and parallel version of DGEMM among other stuff.

SCALAPACK comes with the SLUG, where you can find an excellent explanation of the block-cyclic distribution, that's the data distribution strategy used to squeeze optimal performance arranging liner algebra problems on parallel systems.

To obtain optimal performance however you will need to link your MPI executable with a locally optimized BLAS library. Or write your own, but you are not alone, so don't reinvent the wheel.

Local optimization is obtained accessing matrices not by row, nor by column, but by block. With the block size being tuned in order to optimize the usage of the caches and/or of the TLB (I remember just now libgoto, another blas library, that optimized in order to minimize TLB misses, reached and surpassed on some systems Intel MKL... time ago). Find more information for example in this ATLAS paper.

In any case, if you really want to... I would start analyzing how the other wheels have been forged, before trying to make mine ;)

Sigi
  • 4,826
  • 1
  • 19
  • 23
  • Yes, I get time in the next line, just after allocating the matrix `auto v = mmi::gen_rand_matrix(size);` `double time = MPI_Wtime();`. But I don't understand how this matters, including it will affect all cases, I think. – KcFnMi Jul 27 '18 at 00:15
  • allocation on SMP systems are probably/possibly serialized, or in any way not necessarily independent. In any way they are not intended to be measured in your experiment: you are interested in timing your kernel only. Oh there are also SMP version (ie, distributed memory, in-node, OpenMP) linear algebra libraries. I remember of essl_smp just now, years ago as well – Sigi Jul 27 '18 at 17:36