The problem with this example is, that it only works for square matrices. If matrices are not square you cannot calculate A^t*B^t
because of dimension missmatch (assuming the dimensions were right for A*B
).
I don't have a working cuBLAS-installation at hand, so it is kind of a shot in the dark, but I would be really surprised if cuBLAS would work differently than the usual BLAS. BLAS expects matrices to be in column-major-order (aka Fortran-order) but can also be used for matrices in row-major-order (aka C-order).
In my opinion, which might be completely wrong, gemm_v2
is not the usual/best way to handle multiplication of two C-order matrices, for example because if one multiplies two C-order matrices one would also have a C-order matrix as answer.
The trick to calculate product of two C-order-matrices with help of gemm
would work as follows:
Even if it is probably known to you, I would like first to elaborate on the row-major-order (c-memory-layout) and the column-major-order (fortran-memory-layout), in order to flesh out my answer.
So if we have a 2x3
(i.e. 2 rows and 3 columns) matrix A
, and store it in some continuous memory we get:
row-major-order(A) = A11, A12, A13, A21, A22, A23
col-major-order(A) = A11, A21, A12, A22, A13, A33
That means if we get a continuous memory, which represents a matrix in the row-major-order, and interpret it as a matrix in column-major-order we will get quite a different matrix!
However, if we take a look at the transposed matrix A^t
we can easily see:
row-major-order(A) = col-major-order(A^t)
col-major-order(A) = row-major-order(A^t)
That means, if we would like to get the matrix C
in row-major-order as result, the blas-routine should write the transposed matrix C
in column-major-order (after all this we cannot change) into this very memory. However, C^t=(AB)^t=B^t*A^t
and B^t
an A^t
are the original matrices reinterpreted in column-major-order.
Now, let A
be a n x k
-matrix and B
a k x m
-matrix, the call of gemm routine should be as follows:
gemm('N', 'N', m, n, k, 1.0, B, m, A, k, 0.0, C, m)
Please note:
- We don't have to transpose matrices
A
and B
, because it is handled by reinterpreting C-order as Fortran-order.
- We have to swap the places of matrices
A
and B
in order to get C^t
in Fortran-order as result.
- The resulting matrix
C
is in C-order (by reinterpreting it from Fortran-order to C-order we get rid of ^t
).