3

I'm trying to implement SSE version of large matrix by matrix multiplication. I'm looking for an efficient algorithm based on SIMD implementations.

My desired method looks like:

A(n x m) * B(m x k) = C(n x k)

And all matrices are considered to be 16-byte aligned float array.

I searched the net and found some articles describing 8x8 multiplication and even smaller. I really need it as efficient as possible and I don't want to use Eigen library or similar libraries. (Only SSE3 to be more specific).

So I'd appreciate if anyone can help me find some articles or resources on how to start implementing this.

Hamid Bazargani
  • 847
  • 1
  • 15
  • 30

1 Answers1

9

The main challenge in implementation of arbitrary-size matrix-matrix multiplication is not the use of SIMD, but reuse of cached data. The paper Anatomy of High-Performance Matrix Multiplication by Goto and Van de Geijn is a must-read if you want to implement cache-friendly matrix-matrix multiplication, and it also discusses the choice of kernels to be SIMD-friendly. After reading this paper expect to achieve 50% of machine peak on matrix-matrix multiplication after two weeks of efforts.

However, if the purpose of this work is not pure learning, I strongly recommend to use a highly optimized library. On x86 your best options are OpenBLAS (BSD-licensed, supports dynamic CPU dispatching), BLIS (BSD-licensed, easily portable to new processors), and Intel MKL (commercial, supports dynamic CPU dispatching on Intel processors). For performance reasons it is better to avoid ATLAS unless you target a very exotic architecture which is not supported by other libraries.

Marat Dukhan
  • 11,993
  • 4
  • 27
  • 41
  • 1
    Thanks for a very useful article you provided – Hamid Bazargani Feb 02 '14 at 21:36
  • Good paper Marat. It took me much longer than two weeks to break 50% (using multiple threads as well) but I did not have that paper. I now get over 70% with AVX on Ivy Bridge and 55% with FMA3 on Haswell (still better than 100% compared to Ivy Bridge). – Z boson Feb 04 '14 at 13:01
  • 2
    @Zboson I also recommend to look at the papers on BLIS, especially if you are interested in multi-core optimizations for linear algebra. They are here: https://code.google.com/p/blis/#Citations – Marat Dukhan Feb 04 '14 at 17:14