Efficient SSE NxN matrix multiplication

Question

I'm trying to implement SSE version of large matrix by matrix multiplication. I'm looking for an efficient algorithm based on SIMD implementations.

My desired method looks like:

A(n x m) * B(m x k) = C(n x k)

And all matrices are considered to be 16-byte aligned float array.

I searched the net and found some articles describing 8x8 multiplication and even smaller. I really need it as efficient as possible and I don't want to use Eigen library or similar libraries. (Only SSE3 to be more specific).

So I'd appreciate if anyone can help me find some articles or resources on how to start implementing this.

Using an established library is probably by far the easiest and bang/buck-effective way. What's the problem with that? — , Feb 01 '14 at 21:41
I finally want to implement it on device. I'd prefer it to be independent of third party libs — Hamid Bazargani, Feb 01 '14 at 21:50
What is the data type ? `float`, `double` or something else ? (This makes a big difference when it comes to SIMD implementations.) — Paul R, Feb 02 '14 at 08:18
How about look at the source of a BLAS implementation and see what they do? These codes have lots of optimizations. ATLAS (http://math-atlas.sourceforge.net/) is BSD-licensed. — Adam, Feb 02 '14 at 08:39
Possible duplicate of [Efficient 4x4 matrix multiplication (C vs assembly)](http://stackoverflow.com/questions/18499971/efficient-4x4-matrix-multiplication-c-vs-assembly) — Ciro Santilli OurBigBook.com, Mar 14 '17 at 13:46

Marat Dukhan · Accepted Answer · 2017-02-23T05:24:37.760

9

The main challenge in implementation of arbitrary-size matrix-matrix multiplication is not the use of SIMD, but reuse of cached data. The paper Anatomy of High-Performance Matrix Multiplication by Goto and Van de Geijn is a must-read if you want to implement cache-friendly matrix-matrix multiplication, and it also discusses the choice of kernels to be SIMD-friendly. After reading this paper expect to achieve 50% of machine peak on matrix-matrix multiplication after two weeks of efforts.

However, if the purpose of this work is not pure learning, I strongly recommend to use a highly optimized library. On x86 your best options are OpenBLAS (BSD-licensed, supports dynamic CPU dispatching), BLIS (BSD-licensed, easily portable to new processors), and Intel MKL (commercial, supports dynamic CPU dispatching on Intel processors). For performance reasons it is better to avoid ATLAS unless you target a very exotic architecture which is not supported by other libraries.

edited Feb 23 '17 at 05:24

answered Feb 02 '14 at 09:14

Marat Dukhan

11,993
4
27
41

1

Thanks for a very useful article you provided – Hamid Bazargani Feb 02 '14 at 21:36
Good paper Marat. It took me much longer than two weeks to break 50% (using multiple threads as well) but I did not have that paper. I now get over 70% with AVX on Ivy Bridge and 55% with FMA3 on Haswell (still better than 100% compared to Ivy Bridge). – Z boson Feb 04 '14 at 13:01
2

@Zboson I also recommend to look at the papers on BLIS, especially if you are interested in multi-core optimizations for linear algebra. They are here: https://code.google.com/p/blis/#Citations – Marat Dukhan Feb 04 '14 at 17:14

Efficient SSE NxN matrix multiplication

1 Answers1

Linked

Related