3

For personnal and fun matter, I'm coding a geom lib using SSE(4.1).

I spend last 12h trying to understand a performance issue when dealing with row major vs column major stored matrix.

I know Dirext/OpenGL matrices are stored row major, so it would be better for me to keep my matrices stored in row major order so I will have no conversion when storing/loading matrices to/from GPU/shaders.

But, I made some profiling, and I get faster result with colomun major.

To transform a point with a transfrom matrix in row major, it's P' = P * M. and in column major, it's P' = M * P. So in Column major it's simply 4 dot product , so only 4 SSE4.1 instruction ( _mm_dp_ps ) when in Row major I must do those 4 dot products on the transposed matrix.

Performance result on 10M vectors

(30/05/2014@08:48:10) Log : [5] ( Vec.Mul.Matrix ) = 76.216653 ms ( row major transform )

(30/05/2014@08:48:10) Log : [6] ( Matrix.Mul.Vec ) = 61.554892 ms ( column major tranform )

I tried several way to do Vec * Matrix operation, using _MM_TRANSPOSE or not, and the fastest way I found is this :

mssFloat    Vec4::operator|(const Vec4& v) const //-- Dot Product
{
    return _mm_dp_ps(m_val, v.m_val, 0xFF ).m128_f32[0];
}
inline Vec4 operator*(const Vec4& vec,const Mat4& m)
{
    return Vec4(    Vec4( m[0][0],m[1][0],m[2][0],m[3][0]) | vec
        ,   Vec4( m[0][1],m[1][1],m[2][1],m[3][1]) | vec
        ,   Vec4( m[0][2],m[1][2],m[2][2],m[3][2]) | vec
        ,   Vec4( m[0][3],m[1][3],m[2][3],m[3][3]) | vec
                );
}

my class Vec4 is simply a __m128 m_val, in optimized C++ the vector construction is all done efficiently on SSE register.

My first guess, is that this multiplication is not optimal. I'm new in SSE, so I'm a bit puzzled how to optimize this, my intuition tell me to use shuffle instruction, but I'd like to understand why it would be faster. Will it load 4 shuffle __m128 faster than assigning ( __m128 m_val = _mm_set_ps(w, z, y, x); )

From https://software.intel.com/sites/landingpage/IntrinsicsGuide/ I couldn't find performance info on mm_set_ps

EDIT : I double check the profiling method, each test are done in the same manner, so no memory cache differences. To avoid local cache, I'm doing operation for randomized bug vector array, seed is same for each test. Only 1 test at each execution to avoir performance increase from memory cache.

sashoalm
  • 75,001
  • 122
  • 434
  • 781
SebKun
  • 105
  • 1
  • 10
  • OpenGL uses column major format for matrices. Well, some entry points allow you to specify if your matrices are row major or column major, but traditionally matrices have been column major. – Reto Koradi May 30 '14 at 07:46
  • Hi, thanks replying, I'm shocked :) I read on stack overflow ( I must find again the link ) that OpenGL and DirectX 11 matrices are row major, means, translation in a transform matrix is stored in last 4 elelements http://www.opengl.org/archives/resources/faq/technical/transformations.htm "For programming purposes, OpenGL matrices are 16-value arrays with base vectors laid out contiguously in memory." – SebKun May 30 '14 at 07:55
  • Translation is in elements 13, 14, and 15. Which means that it's column major, because they would be in elements 3, 7, and 11 in a row major matrix. Well, this is if you multiply your vectors by writing them as column vectors to the right of the matrix. If you write them as row vectors to the left of the matrix, things switch around. So depending on if you're a row or column vector type of person, the answer changes. ;) – Reto Koradi May 30 '14 at 08:10
  • well, sorry for the confusion, I wanted to talked about memory storage, I thought row major was when a row = 1 axis, so I guess I inverted everything. I'm no mathematicien, so in memory there's no difference between column vector or row vector ;) ( still i now the differences ;) ). I'll edit my question to be extra clear, I will not talk about row/colmun major but memory storage, would it be clearer about my problem? – SebKun May 30 '14 at 08:14
  • Row major means that if you read the elements in the order they are stored, the elements of the first row are the first 4 elements, the elements of the second row are the next 4, etc. Column major means that the 4 elements of the first column come first, then the 4 elements of the second column, etc. – Reto Koradi May 30 '14 at 08:23
  • you are confusing me ;) I have a transform matrix, represented by 3 axis and a position, axisX/Y/Z and Pos, the element of an axis, are X,Y,Z,W. so in memory X,Y,Z,W are contiguous. Now if in memory my matrix element AxisX,AxisY,AxisZ and Pos are contiguous in memory, is that Row major or column major ? – SebKun May 30 '14 at 08:28
  • @MagicFr You should ask only a single question per post, that helps to keep focus. I'll edit your question and remove the second part, but you can copy it (from the question revision history) and post it as a new question. Is that ok with you? If not, you can revert the edit... I left the "EDIT" because I'm not sure to which question it belongs. – sashoalm May 30 '14 at 08:46
  • hmm, I thought both were related, that's why I put both in same post. – SebKun May 30 '14 at 08:58

2 Answers2

4

Don't use _mm_dp_ps for matrix multiplication! I already explained this in great detail at Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point? (incidentally this was my first post on SO).

You don't need anything for more than SSE to do this efficiently (not even SSE2). Use this code to do 4x4 matrix multiplication efficiently. If the matrices are stored in row-major order than do gemm4x4_SSE(A,B,C). If the matrices are stored in column-major order than do gemm4x4_SSE(B,A,C).

void gemm4x4_SSE(float *A, float *B, float *C) {
    __m128 row[4], sum[4];
    for(int i=0; i<4; i++)  row[i] = _mm_load_ps(&B[i*4]);
    for(int i=0; i<4; i++) {
        sum[i] = _mm_setzero_ps();      
        for(int j=0; j<4; j++) {
            sum[i] = _mm_add_ps(_mm_mul_ps(_mm_set1_ps(A[i*4+j]), row[j]), sum[i]);
        }           
    }
    for(int i=0; i<4; i++) _mm_store_ps(&C[i*4], sum[i]); 
}
Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • well, on my Intel I3 I experimented 3 versions of dot product with SSE1 , 2 and 4, and they all give same performance result. – SebKun May 30 '14 at 09:36
  • @MagicFr. Compile in release mode. State how you measure your performance. – Z boson May 30 '14 at 10:03
  • Of course I compile in release full optimisation ! I even take care of memory cache so comparison is not biased. I tested several dot product and several vec * mat method, using shuffle, hadd and mm_dp_ps, on my intel I3 ( wich natively support SSE4 ) I have same performance result for the 3 methods. – SebKun May 30 '14 at 10:06
  • @MagicFr, when I did my tests the vertical operations using brodcasts beat the horizontal operations `hadd` and `mm_dp_ps`. In any case, what is your question now? The ordering of the matrices should not matter. In my answer you just swap the inputs of the matrices depending on how the matrices are ordered in memory (gemm_SSE(A,B,C) for row-major order and gemm_SSE(B,A,C) for column-major order). – Z boson May 30 '14 at 11:24
  • Thanks answering, first my initial question was about matrix.vector, not matrix.matrix, but I did watch the link you posted and tried the 3 methods, and on my cpu ( Intel I3 sandy bridge ) the 3 methods give same performance. So I don't see why I would not use mm_dp_ps. Also my intial question was most about, why init a row memory matrix was slower than initializing a column memory matrix. But I guess, my questions is not clear as everybody answer me something different ;) – SebKun May 30 '14 at 12:46
0

We actually profiled 3x4 matrix pseudo-multiplication (as-if its a 4x4 affine) and found that in both SSE3 and AVX there was very little difference (<10%) in the column-major vs row-major layouts as long as both are optimized to the limit.

The benchmark https://github.com/buildaworldnet/IrrlichtBAW/blob/master/examples_tests/19.SIMDmatrixMultiplication/main.cpp

Devsh
  • 11
  • 2
  • What hardware did you test on? Also, if `avx::doJob` compiles to the asm instructions exactly matching the intrinsics, I think you're using too many ALU shuffle uops, especially for an Intel CPU. I think you'd be better off doing `_mm256_broadcast_ps()` of 16 bytes straight from memory, not 256-bit load / `vextractf128` / `vinsertf128 same,same,1`. (`_mm256_broadcast_ps` takes a memory operand because there is no reg-reg version of the [asm `vbroadcast` instruction](http://felixcloutier.com/x86/VBROADCAST.html), so if MSVC doesn't optimize that, it might store to a tmp and reload! – Peter Cordes Jun 16 '18 at 21:30
  • All mainstream Intel/AMD CPUs with AVX can do 2 loads per clock from L1d cache, so loading the same data twice (with 256-bit vectors vs. broadcast loads) is not a bad thing if it saves uops. You want to bottleneck on the FMA unit throughput, not front-end or shuffles. But with 2 FMAs per clock, half your total uops have to go to the FMA units to make that happen. – Peter Cordes Jun 16 '18 at 21:32