How to transpose a matrix in an optimal way using blas?

Question

I'm doing some calculations, and doing some analysis on the forces and weakness of different BLAS implementations. however I have come across a problem.

I'm testing cuBlas, doing linAlg on the GPU would seem like a good idea, but there is one problem.

The cuBlas implementation using column-major format, and since this is not what I need in the end, I'm curious if there is a way in with one can make BLAS do matrix-transpose?

"....and since this is not what in the end.....". Missing a couple of words? — talonmies, Oct 16 '11 at 14:39
Starting from CUDA 5.0, cuBLAS have `cublasgeam` which is a very efficient routine to perform matrix transposition. For a full code implementing this solution and comparing the performance with matrix transposition using Thrust, see [What is the most efficient way to transpose a matrix in CUDA?](http://stackoverflow.com/questions/15458552/what-is-the-most-efficient-way-to-transpose-a-matrix-in-cuda/21803459#21803459). — Vitality, Feb 15 '14 at 20:56

talonmies · Accepted Answer · 2015-04-02T16:33:19.703

BLAS doesn't have a matrix transpose routine built in. The CUDA SDK includes a matrix transpose example with a paper which discusses optimal strategy for performing a transpose. Your best strategy is probably to use row major inputs to CUBLAS with the transpose input version of the calls, then perform the intermediate calculations in column major, and lastly perform a transpose operation afterwards using the SDK transpose kernel.

Edited to add that CUBLAS added a transpose routine in CUBLAS version 5, geam, which can performed matrix transposition in GPU memory and should be regarded as optimal for whatever architecture you are using.

How to transpose a matrix in an optimal way using blas?

1 Answers1

Linked