17

I'm trying to use the cuBLAS functions in Anaconda's Numba package and having an issue. I need the input matrices to be in C-order. The output can be in Fortran order.

I can run the example script provided with the package, here. The script has two functions, gemm_v1 and gemm_v2. In gemm_v1, the user has to create the input matrices in Fortran order. In gemm_v2, they can be passed to the cuda implementation of GEMM and transposed on the device. I can get these examples to work with square matrices. However, I can't figure out how to get gemm_v2 to work with non-square input matrices. Is there a way to work with C-order input matrices that are non-square?

Note:
Ideally, both the input and output matrices would stay on the device after the call to GEMM to be used in other calculations ( this is part of an iterative method ).

user3666197
  • 1
  • 6
  • 50
  • 92
user1554752
  • 707
  • 2
  • 10
  • 24
  • in the call to blas, you specify gemm(transa, transb, m, n, k, alpha, A:r, B:r, beta, C:w); where transa and transb are operations to be applied to the matrices. In gemm_v1 example, this is identity operation, in gemm_v2 example it is transpose. Then, you specify m, n and k. These are the #rows of A (m), #columns of A/#rows of B (n) and columns of B (k). If you keep it at the syntax of the example, you specify it to be squared matrices, so this is where to change it. Be sure that the shape of your matrices match the declaration. – Uvar Aug 07 '17 at 11:51

1 Answers1

4

The problem with this example is, that it only works for square matrices. If matrices are not square you cannot calculate A^t*B^t because of dimension missmatch (assuming the dimensions were right for A*B).

I don't have a working cuBLAS-installation at hand, so it is kind of a shot in the dark, but I would be really surprised if cuBLAS would work differently than the usual BLAS. BLAS expects matrices to be in column-major-order (aka Fortran-order) but can also be used for matrices in row-major-order (aka C-order).

In my opinion, which might be completely wrong, gemm_v2 is not the usual/best way to handle multiplication of two C-order matrices, for example because if one multiplies two C-order matrices one would also have a C-order matrix as answer.

The trick to calculate product of two C-order-matrices with help of gemm would work as follows:

Even if it is probably known to you, I would like first to elaborate on the row-major-order (c-memory-layout) and the column-major-order (fortran-memory-layout), in order to flesh out my answer.

So if we have a 2x3 (i.e. 2 rows and 3 columns) matrix A, and store it in some continuous memory we get:

row-major-order(A) = A11, A12, A13, A21, A22, A23
col-major-order(A) = A11, A21, A12, A22, A13, A33

That means if we get a continuous memory, which represents a matrix in the row-major-order, and interpret it as a matrix in column-major-order we will get quite a different matrix!

However, if we take a look at the transposed matrix A^t we can easily see:

row-major-order(A) = col-major-order(A^t)
col-major-order(A) = row-major-order(A^t)

That means, if we would like to get the matrix C in row-major-order as result, the blas-routine should write the transposed matrix C in column-major-order (after all this we cannot change) into this very memory. However, C^t=(AB)^t=B^t*A^t and B^t an A^t are the original matrices reinterpreted in column-major-order.

Now, let A be a n x k-matrix and B a k x m-matrix, the call of gemm routine should be as follows:

gemm('N', 'N', m, n, k, 1.0, B, m, A, k, 0.0, C, m)

Please note:

  1. We don't have to transpose matrices A and B, because it is handled by reinterpreting C-order as Fortran-order.
  2. We have to swap the places of matrices A and B in order to get C^t in Fortran-order as result.
  3. The resulting matrix C is in C-order (by reinterpreting it from Fortran-order to C-order we get rid of ^t).
ead
  • 32,758
  • 6
  • 90
  • 153