vector * matrix product efficiency issue

Question

Just as Z boson recommended, I am using a column-major matrix format in order to avoid having to use the dot product. I don't see a feasible way to avoid it when multiplying a vector with a matrix, though. The matrix multiplication trick requires efficient extraction of rows (or columns, if we transpose the product). To multiply a vector by a matrix, we therefore transpose:

(b * A)^T = A^T * b^T

A is a matrix, b a row vector, which, after being transposed, becomes a column vector. Its rows are just single scalars and the vector * matrix product implementation becomes an inefficient implementation of dot products of columns of (non-transposed) matrix A with b. Is there a way to avoid performing these dot products? The only way I see that could do it, would involve row extraction, which is inefficient with the column-major matrix format.

This question is more suited in http://programmers.stackexchange.com — AndersK, Sep 12 '14 at 10:10
@Claptrap: I'm not sure why programmers would be a better fit... — Oliver Charlesworth, Sep 12 '14 at 10:27
Couldn't you just interchange the loop nesting? (i.e. maintain N partial sums, rather than calculating each complete output value sequentially). — Oliver Charlesworth, Sep 12 '14 at 10:29
@OliCharlesworth Just don't say mathematicians. They'd probably just roll their eyes and say something incomprehensible to us poor laymen/handymen. — user1095108, Sep 12 '14 at 10:30

score 1 · Accepted Answer · answered Sep 12 '14 at 17:33

This can be understood from original post on this (my first on SO) efficient-4x4-matrix-vector-multiplication-with-sse-horizontal-add-and-dot-prod . The rest of the discussion applies to 4x4 matrices.

Here are two methods to do do matrix times vector (v = Mu where v and u are column vectors)

method 1) v1 = dot(row1, u), v2 = dot(row2, u), v3 = dot(row3, u), v4 = dot(row4, u)
method 2) v = u1*col1 + u2*col2 + u3*col3 + u4*col4.

The first method is more familiar from math class while the second is more efficient for a SIMD computer. The second method uses vectorized math (like numpy) e.g.

u1*col1 = (u1x*col1x, u1y*col1y, u1z*col1z, u1w*col1w).

Now let's look at vector times matrix (v = uM where v and u are row vectors)

method 1) v1 = dot(col1, u), v2 = dot(col2, u), v3 = dot(col3, u), v4 = dot(col4, u)
method 2) v = u1*row1 + u2*row2 + u3*row3 + u4*row4.

Now the roles of columns and rows have swapped but method 2 is still the efficient method to use on a SIMD computer.

To do matrix times vector efficiently on a SIMD computer the matrix should be stored in column-major order. To do vector times matrix efficient on a SIMD computer the matrix should be stored in row-major order.

As far as I understand OpenGL uses column major ordering and does matrix times vector and DirectX uses row-major ordering and does vector times matrix. If you have three matrix transformations that you do in order M1 first then M2 then M3 with matrix times vector you write it as

v = M3*M2*M1*u //u and v are column vectors - OpenGL form

With vector times matrix you write

v = u*M1*M2*M3 //u and v are row vectors - DirectX form

Neither form is better than the other in terms of efficiency. It's just a question of notation (and causing confusion which is useful when you have competition).

It's important to note that for matrix*matrix row-major versus column-major storage is irrelevant.

If you want to know why the vertical SIMD instructions are faster than the horizontal ones that's a separate question which should be asked but in short the horizontal ones really act in serial rather than parallel and are broken up into several micro-ops (which is why ironically dppd is faster than dpps).

in short, the trick can't be applied, if the matrix is in column-major order, without transposing it, extracting rows or doing the dot product? — user1095108, Sep 12 '14 at 21:45
Doing more than 1 dot products at a time (say, 4 at a time), i.e. what you do in your matrix multiplication code, so as to avoid dot product instructions. — user1095108, Sep 12 '14 at 22:00
@user1095108, yes, that's correct. If you can't change the ordering of the matrix then you either must do a transpose or take the dot products. In this case method 1 is faster. Are you talking about 4x4 matrices? What's your goal? — Z boson, Sep 12 '14 at 22:04
You wrote "You almost never want to do a dot product of two vectors with SSE. Instead you do do four dot products at once" and I thought this rule could be applied every time effectively (apparently not). I tried to apply your rule to vector-matrix (b * A) multiplications with matrices 2x2 to 4x4 (I did notice, it could be applied w/o problems to col*matrix and matrix*matrix multiplications), but couldn't. But vector*matrix products also need to be supported in a generic matrix library, even if they are rarely calculated. — user1095108, Sep 14 '14 at 11:05
@user1095108, you can use "my rule" just take the transpose first. But you're welcome to try the horizontal instructions and see if they make a difference for this special case. — Z boson, Sep 14 '14 at 11:58
I am aware of this (as I wrote in my previous comment), but I believe, doing the dot products might be better/faster (just as apparently you do as well). — user1095108, Sep 14 '14 at 13:15
@user1095108, well let me know what you find. When I tested this I wrote "Otherwise use method 1 with horizontal addition. Don't use the dot production instruction for matrix multiplication." in the comments so for this special case the `hadd` instruction was faster but `dpps` was still slower than taking the transpose and using "the trick". — Z boson, Sep 14 '14 at 15:25

vector * matrix product efficiency issue

1 Answers1

Linked