1

I am using a commercial simulation software on Linux that does intensive matrix manipulation. The software uses Intel MKL by default, but it allows me to replace it with a custom BLAS/LAPACK library. This library must be a shared object (.so) library and must export both BLAS and LAPACK standard routines. The software requires the standard Fortran interface for all of them.

To verify that I can use a custom library, I compiled ATLAS and linked LAPACK (from netlib) inside it. The software was able to use my compiled ATLAS version without any problems.

Now, I want to make the software use cuBLAS in order to enhance the simulation speed. I was confronted by the problem that cuBLAS doesn't export the standard BLAS function names (they have a cublas prefix). Moreover, the library cuBLAS library doesn't include LAPACK routines. I use readelf -a to check for the exported function.

On another hand, I tried to use MAGMA to solve this problem. I succeeded to compile and link it against all of ATLAS, LAPACK and cuBLAS. But still it doesn't export the correct functions and doesn't include LAPACK in the final shared object. I am not sure if this is the way it is supposed to be or I did something wrong during the build process.

I have also found CULA, but I am not sure if this will solve the problem or not.

Did anybody tried to get cuBLAS/LAPACK (or a proper wrapper) linked into a single (.so) exporting the standard Fortran interface with the correct function names? I believe it is conceptually possible, but I don't know how to do it!

Bichoy
  • 351
  • 2
  • 9
  • See http://stackoverflow.com/q/11576073/681865 for a discussion about why what you want to do isn't a very good idea. – talonmies Sep 16 '13 at 06:03
  • 1
    For a limited example (replacing a single BLAS function -- Dgemm) of using the cublas thunking interface in an application (Octave) that uses the BLAS library, see [here](http://stackoverflow.com/questions/17493270/converting-octave-to-use-cublas/17493698#17493698). In this particular case, for large matrix multiplies, the overhead/cost of transferring the data to the GPU is more than offset by the reduced computation time. – Robert Crovella Sep 17 '13 at 17:02

1 Answers1

2

Updated

As indicated by @talonmies, CUDA has provided a fortran thunking wrapper interface.

http://docs.nvidia.com/cuda/cublas/index.html#appendix-b-cublas-fortran-bindings

You should be able to run your application with it. But you probably will not get any performance improvement due to the mem alloc/copy issue described below.

Old

It may not easy. CUBLAS and other CUDA library interfaces assume all the data are already stored in device memory, however in your case, all the data are still in CPU RAM before calling.

You may have to write your own wrapper to deal with it like

void dgemm(...) {
  copy_data_from_cpu_ram_to_gpu_mem();
  cublas_dgemm(...);
  copy_data_from_gpu_mem_to_cpu_ram();
}

On the other hand, you probably have noticed that every single BLAS call requires 2 data copies. This may introduce huge overhead and slow down the overall performance, unless most of your callings are BLAS 3 operations.

kangshiyin
  • 9,681
  • 1
  • 17
  • 29
  • CUBLAS already comes with a built-in set of Fortran bindings and a "thunking" wrapper interface - see http://docs.nvidia.com/cuda/cublas/index.html#appendix-b-cublas-fortran-bindings – talonmies Sep 16 '13 at 06:08
  • Great. Then the only problem is mem alloc/copy, which probably will make this kind of wrapper slower than MKL. – kangshiyin Sep 16 '13 at 06:21
  • the thunking interface includes memory allocation and transfers - and the documentation contains a warning about the negative performance effects that entails – talonmies Sep 16 '13 at 07:00