146

Out of curiosity I decided to benchmark my own matrix multiplication function versus the BLAS implementation... I was to say the least surprised at the result:

Custom Implementation, 10 trials of 1000x1000 matrix multiplication:

Took: 15.76542 seconds.

BLAS Implementation, 10 trials of 1000x1000 matrix multiplication:

Took: 1.32432 seconds.

This is using single precision floating point numbers.

My Implementation:

template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
    if ( ADim2!=BDim1 )
        throw std::runtime_error("Error sizes off");

    memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
    int cc2,cc1,cr1;
    for ( cc2=0 ; cc2<BDim2 ; ++cc2 )
        for ( cc1=0 ; cc1<ADim2 ; ++cc1 )
            for ( cr1=0 ; cr1<ADim1 ; ++cr1 )
                C[cc2*ADim2+cr1] += A[cc1*ADim1+cr1]*B[cc2*BDim1+cc1];
}

I have two questions:

  1. Given that a matrix-matrix multiplication say: nxm * mxn requires n*n*m multiplications, so in the case above 1000^3 or 1e9 operations. How is it possible on my 2.6Ghz processor for BLAS to do 10*1e9 operations in 1.32 seconds? Even if multiplcations were a single operation and there was nothing else being done, it should take ~4 seconds.
  2. Why is my implementation so much slower?
hippietrail
  • 15,848
  • 18
  • 99
  • 158
DeusAduro
  • 5,971
  • 5
  • 29
  • 36
  • 27
    BLAS has been optimized up one side and down the other by specialist in the field. I assume it is taking advantage of the SIMD floating point unit on your chip and playing lots of tricks to improve the caching behavior as well... – dmckee --- ex-moderator kitten Aug 19 '09 at 23:36
  • 7
    Still how do you do 1E10 operations on a 2.63E9 cycles/second processor in 1.3 seconds? – DeusAduro Aug 19 '09 at 23:41
  • 15
    Multiple execution units, pipe-lining, and Single Instruction Multiple Data ((SIMD) which means doing the same operation on more than one pair of operands at the same time). Some compilers can target the SIMD units on common chips but you just about always have to explicitly turn in on, and it helps to know how it all works (http://en.wikipedia.org/wiki/SIMD). Insuring against cache misses is almost certainly the hard part. – dmckee --- ex-moderator kitten Aug 19 '09 at 23:45
  • 1
    Very interesting, good thing I have no goal at re-implementing BLAS. It was just really bothering me that they were doing more operations per second than my CPU clock speed. But the SIMD could definitely be the key, its also the MKL library which is optimized for Intel (what I have) systems. – DeusAduro Aug 20 '09 at 00:12
  • 16
    Assumption is wrong. There are better algorithms known, see Wikipedia. – MSalters Aug 20 '09 at 07:36
  • 1
    Or should the question be "How wasteful are _normal_ processor calls"? ;) – isomorphismes Apr 07 '15 at 04:12
  • 3
    @DeusAduro: In my answer for [How to write a matrix matrix product that can compete with Eigen?](http://stackoverflow.com/questions/35620853/how-to-write-a-matrix-matrix-product-that-can-compete-with-eigen/35637007#35637007) I posted a small example on how to implement a cache efficient matrix-matrix product. – Michael Lehn Mar 01 '16 at 20:11
  • 1
    Question about this SIMD-enhanced performance: Is this what people refer to as vectorization? Or is this just standard parallelization? – Ben Sandeen Sep 03 '16 at 19:59
  • 1
    Back in 09 (when you seem to have asked the question) it was common with at least dual (2) core and sometimes quad (4) core CPUs. In ideal situation the execution time could be divided by number of cores if you wrote an application which could utilize all cores. So 4 seconds could become 1 second if what you calculated was possible to split up really nicely between the computing units and you had say 4 cores. – mathreadler Jan 25 '17 at 17:02

8 Answers8

206

A good starting point is the great book The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ortí. They provide a free download version.

BLAS is divided into three levels:

  • Level 1 defines a set of linear algebra functions that operate on vectors only. These functions benefit from vectorization (e.g. from using SIMD such as SSE).

  • Level 2 functions are matrix-vector operations, e.g. some matrix-vector product. These functions could be implemented in terms of Level 1 functions. However, you can boost the performance of these functions if you can provide a dedicated implementation that makes use of some multiprocessor architecture with shared memory.

  • Level 3 functions are operations like the matrix-matrix product. Again you could implement them in terms of Level 2 functions. But Level 3 functions perform O(N^3) operations on O(N^2) data. So if your platform has a cache hierarchy then you can boost performance if you provide a dedicated implementation that is cache optimized/cache friendly. This is nicely described in the book. The main boost of Level 3 functions comes from cache optimization. This boost significantly exceeds the second boost from parallelism and other hardware optimizations.

By the way, most (or even all) of the high performance BLAS implementations are NOT implemented in Fortran. ATLAS is implemented in C. GotoBLAS/OpenBLAS is implemented in C and its performance-critical parts in Assembler. Only the reference implementation of BLAS is implemented in Fortran. However, all these BLAS implementations provide a Fortran interface such that it can be linked against LAPACK (LAPACK gains all its performance from BLAS).

Optimized compilers play a minor role in this respect (and for GotoBLAS/OpenBLAS the compiler does not matter at all).

IMHO no BLAS implementation uses algorithms like the Coppersmith–Winograd algorithm or the Strassen algorithm. The likely reasons are:

  • Maybe it's not possible to provide a cache-optimized implementation of these algorithms (i.e. you would lose more than you would win)
  • These algorithms are numerically not stable. As BLAS is the computational kernel of LAPACK this is a no-go.
  • Although these algorithms have a nice time complexity on paper, the Big O notation hides a large constant, so it only starts to become viable for extremely large matrices.

Edit/Update:

The new and groundbreaking papers for this topic are the BLIS papers. They are exceptionally well written. For my lecture "Software Basics for High Performance Computing" I implemented the matrix-matrix product following their paper. Actually I implemented several variants of the matrix-matrix product. The simplest variant is entirely written in plain C and has less than 450 lines of code. All the other variants merely optimize the loops

    for (l=0; l<MR*NR; ++l) {
        AB[l] = 0;
    }
    for (l=0; l<kc; ++l) {
        for (j=0; j<NR; ++j) {
            for (i=0; i<MR; ++i) {
                AB[i+j*MR] += A[i]*B[j];
            }
        }
        A += MR;
        B += NR;
    }

The overall performance of the matrix-matrix product only depends on these loops. About 99.9% of the time is spent here. In the other variants I used intrinsics and assembler code to improve the performance. You can see the tutorial going through all the variants here:

ulmBLAS: Tutorial on GEMM (Matrix-Matrix Product)

Together with the BLIS papers it becomes fairly easy to understand how libraries like Intel MKL can gain such performance. And why it does not matter whether you use row or column major storage!

The final benchmarks are here (we called our project ulmBLAS):

Benchmarks for ulmBLAS, BLIS, MKL, openBLAS and Eigen

Another Edit/Update:

I also wrote some tutorials on how BLAS is used for numerical linear algebra problems like solving a system of linear equations:

High Performance LU Factorization

(This LU factorization is for example used by Matlab for solving a system of linear equations.)

I hope to find time to extend the tutorial to describe and demonstrate how to realise a highly scalable parallel implementation of the LU factorization like in PLASMA.

Ok, here you go: Coding a Cache Optimized Parallel LU Factorization

P.S.: I also did make some experiments on improving the performance of uBLAS. It actually is pretty simple to boost (yeah, play on words :) ) the performance of uBLAS:

Experiments on uBLAS.

Here a similar project with BLAZE:

Experiments on BLAZE.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Michael Lehn
  • 2,934
  • 2
  • 17
  • 19
  • 4
    New link to “Benchmarks for ulmBLAS, BLIS, MKL, openBLAS and Eigen”: http://apfel.mathematik.uni-ulm.de/~lehn/ulmBLAS/#toc3 – Ahmed Fasih Nov 30 '17 at 16:03
  • It turns out IBM's ESSL uses a variation of the Strassen algorithm - https://www.ibm.com/support/knowledgecenter/en/SSFHY8/essl_welcome.html – ben-albrecht Oct 01 '18 at 13:36
  • 2
    most of the links are dead – Aurélien Pierre Jan 26 '19 at 20:03
  • 1
    A PDF of TSoPMC can be found on the page of the author, at https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf – Alex Shpilkin Feb 03 '20 at 13:07
  • 1
    Although the Coppersmith-Winograd algorithm has a nice time complexity on paper, the Big O notation hides a very large constant, so it only starts to become viable for ridiculously large matrices. – Nihar Karve Jun 26 '20 at 05:03
  • The illustrating code example does not calculate matrix-matrix product. The code treats A like it is kc x NR matrix. However A should be NR x kc matrix for multiplying with kc x MR matrix B. That would give NR x MR matrix AB as product. – Öö Tiib Apr 06 '22 at 11:59
  • @ÖöTiib You mean the code snippet? There A is MR x kc (stored col wise), B is kc x NR (stored row wise) and AB is MR X NR (stored col wise). – Michael Lehn Apr 07 '22 at 12:19
  • @MichaelLehn Thanks, it was confusing. Each variant from 8 algorithms of row- or column major data layout has different optimal order of iterating. So choosing one with different layout of A and B felt unexpected. C++ has canonically row-major and Fortran column-major order of data layout. – Öö Tiib Apr 08 '22 at 07:43
  • @ÖöTiib Please understand that the algorithm computes the matrix product independent of the storage order of A, B or C. It's explained in more detail in the paper and also on the linked sites. The basic idea: It computes the product block wise. Each block of A and B should fit into cache. Before it multiplies a block of A with a block of B it copies these blocks into a buffer. This copying is called packing. Hereby the the store order for the buffers can freely be chosen. And it get chosen in favour of the actual cache hierarchy and AVX register sizes. – Michael Lehn Apr 08 '22 at 11:17
  • @ÖöTiib Also, it is a common misunderstanding that matrices have to be stored row major in C/C++. Every high performance implementation of BLAS (defined through a Fortan reference implementation) is written in C or C++ or assembly. – Michael Lehn Apr 08 '22 at 11:24
30

So first of all BLAS is just an interface of about 50 functions. There are many competing implementations of the interface.

Firstly I will mention things that are largely unrelated:

  • Fortran vs C, makes no difference
  • Advanced matrix algorithms such as Strassen, implementations dont use them as they dont help in practice

Most implementations break each operation into small-dimension matrix or vector operations in the more or less obvious way. For example a large 1000x1000 matrix multiplication may broken into a sequence of 50x50 matrix multiplications.

These fixed-size small-dimension operations (called kernels) are hardcoded in CPU-specific assembly code using several CPU features of their target:

  • SIMD-style instructions
  • Instruction Level Parallelism
  • Cache-awareness

Furthermore these kernels can be executed in parallel with respect to each other using multiple threads (CPU cores), in the typical map-reduce design pattern.

Take a look at ATLAS which is the most commonly used open source BLAS implementation. It has many different competing kernels, and during the ATLAS library build process it runs a competition among them (some are even parameterized, so the same kernel can have different settings). It tries different configurations and then selects the best for the particular target system.

(Tip: That is why if you are using ATLAS you are better off building and tuning the library by hand for your particular machine then using a prebuilt one.)

Stefan Zobel
  • 3,182
  • 7
  • 28
  • 38
Andrew Tomazos
  • 66,139
  • 40
  • 186
  • 319
  • 1
    ATLAS is no longer the most commonly used open source BLAS implementation. It has been surpassed by OpenBLAS (a fork of the GotoBLAS) and BLIS (a refactoring of the GotoBLAS). – Robert van de Geijn Feb 09 '19 at 23:38
  • 1
    @ulaff.net: That maybe. This was written 6 years ago. I think the fastest BLAS implementation currently (on Intel of course) is Intel MKL, but it isn't open source. – Andrew Tomazos Feb 11 '19 at 14:34
  • I agree with the spirit of your answer. Here's an academic link, but it shows that some have used Strassen-type/Winograd-type algorithms to have real world speedups https://www.ics.uci.edu/~paolo/FastMM/FMM-Reference/reference.html – creanion Nov 13 '21 at 13:35
17

First, there are more efficient algorithms for matrix multiplication than the one you're using.

Second, your CPU can do much more than one instruction at a time.

Your CPU executes 3-4 instructions per cycle, and if the SIMD units are used, each instruction processes 4 floats or 2 doubles. (of course this figure isn't accurate either, as the CPU can typically only process one SIMD instruction per cycle)

Third, your code is far from optimal:

  • You're using raw pointers, which means that the compiler has to assume they may alias. There are compiler-specific keywords or flags you can specify to tell the compiler that they don't alias. Alternatively, you should use other types than raw pointers, which take care of the problem.
  • You're thrashing the cache by performing a naive traversal of each row/column of the input matrices. You can use blocking to perform as much work as possible on a smaller block of the matrix, which fits in the CPU cache, before moving on to the next block.
  • For purely numerical tasks, Fortran is pretty much unbeatable, and C++ takes a lot of coaxing to get up to a similar speed. It can be done, and there are a few libraries demonstrating it (typically using expression templates), but it's not trivial, and it doesn't just happen.
jalf
  • 243,077
  • 51
  • 345
  • 550
  • Thanks, I've added restrict correct code as per Justicle's suggestion, didn't see much improvement, I like the blockwise idea. Out of curiosity, without knowing the CPU's cache size how would one right optimal code? – DeusAduro Aug 21 '09 at 06:25
  • 2
    You don't. To get optimal code, you need to know the CPU's cache size. Of course the downside to this is that you're effectively hardcoding your code for best performance on *one* family of CPU's. – jalf Aug 21 '09 at 10:00
  • 2
    At least the inner loop here avoids strided loads. It looks like this is written for one matrix already being transposed. That's why it's "only" one order of magnitude slower than BLAS! But yeah, it's still thrashing because of the lack of cache-blocking. Are you sure Fortran would help much? I think all you'd gain here is that `restrict` (no aliasing) is the default, unlike in C / C++. (And unfortunately ISO C++ doesn't have a `restrict` keyword, so you have to use `__restrict__` on compilers that provide it as an extension). – Peter Cordes Oct 03 '17 at 21:35
12

I don't know specfically about BLAS implementation but there are more efficient alogorithms for Matrix Multiplication that has better than O(n3) complexity. A well know one is Strassen Algorithm

softveda
  • 10,858
  • 6
  • 42
  • 50
  • 12
    The Strassen Algorithm is not used in numerics for two reasons: 1) It is not stable. 2) You save some computations but that comes with the price that you can exploit cache hierarchies. In practice you even loose performance. – Michael Lehn Oct 12 '13 at 23:59
  • 7
    For the practical implementation of Strassen Algorithm tightly built upon BLAS library source code, there is a recent publication: "[Strassen Algorithm Reloaded](http://dl.acm.org/citation.cfm?id=3014983)" in SC16, which achieves higher performance than BLAS, even for the problem size 1000x1000. – Jianyu Huang Mar 04 '17 at 01:45
4

Most arguments to the second question -- assembler, splitting into blocks etc. (but not the less than N^3 algorithms, they are really overdeveloped) -- play a role. But the low velocity of your algorithm is caused essentially by matrix size and the unfortunate arrangement of the three nested loops. Your matrices are so large that they do not fit at once in cache memory. You can rearrange the loops such that as much as possible will be done on a row in cache, this way dramatically reducing cache refreshes (BTW splitting into small blocks has an analogue effect, best if loops over the blocks are arranged similarly). A model implementation for square matrices follows. On my computer its time consumption was about 1:10 compared to the standard implementation (as yours). In other words: never program a matrix multiplication along the "row times column" scheme that we learned in school. After having rearranged the loops, more improvements are obtained by unrolling loops, assembler code etc.

    void vector(int m, double ** a, double ** b, double ** c) {
      int i, j, k;
      for (i=0; i<m; i++) {
        double * ci = c[i];
        for (k=0; k<m; k++) ci[k] = 0.;
        for (j=0; j<m; j++) {
          double aij = a[i][j];
          double * bj = b[j];
          for (k=0; k<m; k++)  ci[k] += aij*bj[k];
        }
      }
    }

One more remark: This implementation is even better on my computer than replacing all by the BLAS routine cblas_dgemm (try it on your computer!). But much faster (1:4) is calling dgemm_ of the Fortran library directly. I think this routine is in fact not Fortran but assembler code (I do not know what is in the library, I don't have the sources). Totally unclear to me is why cblas_dgemm is not as fast since to my knowledge it is merely a wrapper for dgemm_.

3

This is a realistic speed up. For an example of what can be done with SIMD assembler over C++ code, see some example iPhone matrix functions - these were over 8x faster than the C version, and aren't even "optimized" assembly - there's no pipe-lining yet and there is unnecessary stack operations.

Also your code is not "restrict correct" - how does the compiler know that when it modifies C, it isn't modifying A and B?

Justicle
  • 14,761
  • 17
  • 70
  • 94
  • Sure if you called the function like mmult(A..., A..., A); you would certainly not get the expected result. Again though I wasn't trying to beat/re-implement BLAS, just seeing how fast it really is, so error checking wasn't in mind, just the basic functionality. – DeusAduro Aug 20 '09 at 00:15
  • 3
    Sorry, to be clear, what I'm saying is that if you put "restrict" on your pointers, you'd get much faster code. This is because every time you modifiy C, the compiler doesn't have to reload A and B - dramatically speeding up the inner loop. If you don't believe me, check the disassembly. – Justicle Aug 20 '09 at 00:34
  • @DeusAduro: This isn't error checking - it's possible that the compiler is unable to optimize the accesses to the B[] array in the inner loop because it might not be able to figure out that the A and C pointers never alias the B array. If there were aliasing it would be possible for the value in the B array to change while the inner loop is executing. Hoisting the access to the B[] value out of the inner loop and putting it in a local variable might enable the compiler to avoid continual accesses to B[]. – Michael Burr Aug 20 '09 at 00:35
  • Thanks for the clarification I guess I really didn't understand what you are saying, I'll see if I can add that and update the timings. – DeusAduro Aug 20 '09 at 00:58
  • 1
    Hmmm, so I tried first using the '__restrict' keyword in VS 2008, applied to A, B, and C. This showed no change in the result. However moving the access to B, from the innermost loop to the loop outside improved the time by ~10%. – DeusAduro Aug 20 '09 at 01:13
  • 1
    Sorry, I'm not sure about VC, but with GCC you need to enable `-fstrict-aliasing`. There's also better explanation of "restrict" here: http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html – Justicle Aug 20 '09 at 01:46
3

With respect to the original code in MM multiply, memory reference for most operation is the main cause of bad performance. Memory is running at 100-1000 times slower than cache.

Most of speed up comes from employing loop optimization techniques for this triple loop function in MM multiply. Two main loop optimization techniques are used; unrolling and blocking. With respect to unrolling, we unroll the outer two most loops and block it for data reuse in cache. Outer loop unrolling helps optimize data-access temporally by reducing the number of memory references to same data at different times during the entire operation. Blocking the loop index at specific number, helps with retaining the data in cache. You can choose to optimize for L2 cache or L3 cache.

https://en.wikipedia.org/wiki/Loop_nest_optimization

Pari Rajaram
  • 422
  • 3
  • 7
-25

For many reasons.

First, Fortran compilers are highly optimized, and the language allows them to be as such. C and C++ are very loose in terms of array handling (e.g. the case of pointers referring to the same memory area). This means that the compiler cannot know in advance what to do, and is forced to create generic code. In Fortran, your cases are more streamlined, and the compiler has better control of what happens, allowing him to optimize more (e.g. using registers).

Another thing is that Fortran store stuff columnwise, while C stores data row-wise. I havent' checked your code, but be careful of how you perform the product. In C you must scan row wise: this way you scan your array along contiguous memory, reducing the cache misses. Cache miss is the first source of inefficiency.

Third, it depends of the blas implementation you are using. Some implementations might be written in assembler, and optimized for the specific processor you are using. The netlib version is written in fortran 77.

Also, you are doing a lot of operations, most of them repeated and redundant. All those multiplications to obtain the index are detrimental for the performance. I don't really know how this is done in BLAS, but there are a lot of tricks to prevent expensive operations.

For example, you could rework your code this way

template<class ValT>
void mmult(const ValT* A, int ADim1, int ADim2, const ValT* B, int BDim1, int BDim2, ValT* C)
{
if ( ADim2!=BDim1 ) throw std::runtime_error("Error sizes off");

memset((void*)C,0,sizeof(ValT)*ADim1*BDim2);
int cc2,cc1,cr1, a1,a2,a3;
for ( cc2=0 ; cc2<BDim2 ; ++cc2 ) {
    a1 = cc2*ADim2;
    a3 = cc2*BDim1
    for ( cc1=0 ; cc1<ADim2 ; ++cc1 ) {
          a2=cc1*ADim1;
          ValT b = B[a3+cc1];
          for ( cr1=0 ; cr1<ADim1 ; ++cr1 ) {
                    C[a1+cr1] += A[a2+cr1]*b;
           }
     }
  }
} 

Try it, I am sure you will save something.

On you #1 question, the reason is that matrix multiplication scales as O(n^3) if you use a trivial algorithm. There are algorithms that scale much better.

Stefano Borini
  • 138,652
  • 96
  • 297
  • 431
  • In terms of the multiplications to find indices, I tried actually taking them out and only doing them the minimum, ie. C, B the mult can be moved below the first for loop. This showed no improvement, I believe my compiler optimized it somehow... I think register optimization would definitely improve my code, however any response to #1? – DeusAduro Aug 19 '09 at 23:42
  • did you try to crank the optimization flag to -O3 as well ? – Stefano Borini Aug 19 '09 at 23:49
  • Ya its got all optimizations on. Thanks for the link, although that one is 'never implemented' but the Strassen algorithm is interesting. – DeusAduro Aug 20 '09 at 00:09
  • about row vs columnwise, matrix multiplication is going to thrash the cache no matter what, since the left hand side gets traversed row-wise, and the right hand column-wise. – jalf Aug 20 '09 at 12:03
  • @Jalf: yes. in the case of matrix multiplication this happens. I wonder, however, if pretransposing the matrix could be a good strategy. – Stefano Borini Aug 20 '09 at 21:36
  • 39
    This answer is completely wrong sorry. BLAS implementations are not written in fortran. The performence-critical code is written in assembly, and the most common ones these days are written in C above that. Also BLAS specifies the row/column order as part of the interface, and implementations can handle any combination. – Andrew Tomazos Jun 26 '13 at 14:14
  • 10
    Yes, this answer *is* completely wrong. Unfortunately it is full of common non-sense, e.g. the claim BLAS was faster because of Fortran. Having 20(!) positive ratings is a bad thing. Now this non-sense even spreads further because of the popularity of Stackoverflow! – Michael Lehn Sep 26 '13 at 11:32
  • @Michael Fair enough, I'll delete it as soon as you read this message. Let me know. Or maybe I should edit it ? – Stefano Borini Sep 26 '13 at 22:53
  • 1
    @Andrew: Are you really sure it's written in asm? I downloaded blas from netlib, and it's pure fortran. Maybe we are talking about two different things. While there might be implementations of the blas interface that are written in asm, the netlib version is in fortran, and I think it's the one used by the large majority of unoptimized vendors. – Stefano Borini Oct 04 '13 at 15:12
  • 1
    The ignorance amazes me: BLAS as everyone uses it is Fortran 77. While GSL does make CBLAS, but I have yet to see an assembly implementation of BLAS. Could @AndrewTomazos provide such a reference? – Kyle Kanos Oct 04 '13 at 15:39
  • 13
    I think you're confusing the unoptimized reference implementation with production implementations. The reference implementation is just for specifying the interface and behaviour of the library, and was written in Fortran for historical reasons. It isn't for production use. In production people use optimized implementations that exhibit the same behaviour as the reference implementation. I've studied the internals of ATLAS (which backs Octave - Linux "MATLAB") which I can confirm first hand is written in C/ASM internally. The commercial implementations are almost certainly as well. – Andrew Tomazos Oct 05 '13 at 13:53
  • 5
    @KyleKanos: Yes, here is the source of ATLAS: http://sourceforge.net/projects/math-atlas/files/Stable/3.10.1/ As far as I know it is the most commonly used open source portable BLAS implementation. It is written in C/ASM. High performance CPU manufacturers like Intel, also provide BLAS implementations especially optimized for their chips. I guarantee at the low level parts of Intels library are are written in (duuh) x86 assembly, and I'm pretty sure the mid-level parts would be written in C or C++. – Andrew Tomazos Oct 05 '13 at 13:57
  • @AndrewTomazos: Every cluster I have ever seen uses netlib's implementation of BLAS (which means Fortran). In fact, I guarantee that **every** cluster uses that particular BLAS because the Top 500 is determined based on the results of LINPACK which uses (duuh) netlib's BLAS. Perhaps *some* people are using ATLAS, but **the** BLAS used is the Fortran-based one. – Kyle Kanos Oct 05 '13 at 14:10
  • 10
    @KyleKanos: You're confused. Netlib BLAS is the reference implementation. The reference implementation is much slower than optimized implementations (see [performance comparison](http://www.lr.tudelft.nl/fileadmin/Faculteit/LR/Organisatie/Afdelingen_en_Leerstoelen/Afdeling_RS/Physical_and_Space_Geodesy/Publications/Tobias_Wittwer/doc/blas_lapack.pdf)). When someone says they are using netlib BLAS on a cluster, it doesn't mean they are actually using the netlib reference implementation. That would be just silly. It just means they are using a lib with the same interface as the netlib blas. – Andrew Tomazos Oct 05 '13 at 15:18
  • 2
    @KyleKanos: Andrew is right. Think of BLAS as a standardized *interface* for a bunch of mathematical core functions. For this many different implementations exist. Some of this implementations are highly optimized (ATLAS, GotoBLAS) and are written in C/ASM. And there also exists the *reference* implementation from Netlib. Nobody uses it on a cluster. Also check out wikipedia for a more complete list of implementations: http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms – Michael Lehn Oct 12 '13 at 23:03
  • 1
    @KyleKanos: Also checkout this slide on the Netlib's LAPACK site itself: http://www.netlib.org/lapack/lug/node11.html From there also follow the link to the BLAS FAQ: http://www.netlib.org/blas/faq.html – Michael Lehn Oct 12 '13 at 23:21
  • 2
    Compilers can easily strength-reduce the multiplies to adds at compile time. Manually hoisting `ValT b = B[a3+cc1];` is useful if you don't have `__restrict__` in C++ (like C99's `restrict` keyword) to let the compiler know that the assignments into `C[]` won't change the value of `B[a3+cc1];`. Most of the rest of this answer is nonsense, though. (`restrict` is the default in Fortran, though: you pass arrays, not pointers, so the compiler knows they don't overlap. With `restrict`, naive C should be about as good as naive Fortran.) – Peter Cordes Oct 03 '17 at 21:29