3

I'd like to convert Octave to use CuBLAS for matrix multiplication. This video seems to indicate this is as simple as typing 28 characters:

Using CUDA Library to Accelerate Applications

In practice it's a bit more complex than this. Does anyone know what additional work must be done to make the modifications made in this video compile?

UPDATE

Here's the method I'm trying

in dMatrix.cc add

#include <cublas.h>

in dMatrix.cc change all occurences of (preserving case)

dgemm

to

cublas_dgemm

in my build terminal set

export CC=nvcc
export CFLAGS="-lcublas -lcudart"
export CPPFLAGS="-I/usr/local/cuda/include"
export LDFLAGS="-L/usr/local/cuda/lib64"

the error I receive is:

libtool: link: g++ -I/usr/include/freetype2 -Wall -W -Wshadow -Wold-style-cast 
-Wformat -Wpointer-arith -Wwrite-strings -Wcast-align -Wcast-qual -g -O2
-o .libs/octave octave-main.o  -L/usr/local/cuda/lib64 
../libgui/.libs/liboctgui.so ../libinterp/.libs/liboctinterp.so 
../liboctave/.libs/liboctave.so -lutil -lm -lpthread -Wl,-rpath
-Wl,/usr/local/lib/octave/3.7.5

../liboctave/.libs/liboctave.so: undefined reference to `cublas_dgemm_'
Community
  • 1
  • 1
Waylon Flinn
  • 19,969
  • 15
  • 70
  • 72
  • 2
    Actually (sorry) there's more to it than this. The presenter is using the [fortran bindings](http://docs.nvidia.com/cuda/cublas/index.html#topic_11) Specifically we need to use the "thunking wrapper" version of these bindings. Apologies for the misinformation in my answer, there is more to it than that. I will try to update my answer with a new set of instructions. The modifications you made to the source file are still valid, but the modifications to the build sequence will be different. – Robert Crovella Jul 05 '13 at 23:35
  • @RobertCrovella Thank you very much. I'm really looking forward to having this. – Waylon Flinn Jul 06 '13 at 01:20
  • I've made another (hopefully the last) edit to my answer. I tested this and it seems to work for me. – Robert Crovella Jul 06 '13 at 16:48

3 Answers3

8

EDIT2: The method described in this video requires the use of the fortran "thunking library" bindings for cublas. These steps worked for me:

  1. Download octave 3.6.3 from here:

    wget ftp://ftp.gnu.org/gnu/octave/octave-3.6.3.tar.gz
    
  2. extract all files from the archive:

    tar -xzvf octave-3.6.3.tar.gz
    
  3. change into the octave directory just created:

    cd octave-3.6.3
    
  4. make a directory for your "thunking cublas library"

    mkdir mycublas
    
  5. change into that directory

    cd mycublas
    
  6. build the "thunking cublas library"

    g++ -c -fPIC -I/usr/local/cuda/include -I/usr/local/cuda/src -DCUBLAS_GFORTRAN -o fortran_thunking.o /usr/local/cuda/src/fortran_thunking.c
    ar rvs libmycublas.a fortran_thunking.o
    
  7. switch back to the main build directory

    cd ..
    
  8. run octave's configure with additional options:

    ./configure --disable-docs LDFLAGS="-L/usr/local/cuda/lib64 -lcublas -lcudart -L/home/user2/octave/octave-3.6.3/mycublas -lmycublas"
    

    Note that in the above command line, you will need to change the directory for the second -L switch to that which matches the path to your mycublas directory that you created in step 4

  9. Now edit octave-3.6.3/liboctave/dMatrix.cc according to the instructions given in the video. It should be sufficient to replace every instance of dgemm with cublas_dgemm and every instance of DGEMM with CUBLAS_DGEMM. In the octave 3.6.3 version I used, there were 3 such instances of each (lower case and upper case).

  10. Now you can build octave:

    make
    

    (make sure you are in the octave-3.6.3 directory)

At this point, for me, Octave built successfully. I did not pursue make install although I assume that would work. I simply ran octave using the ./run-octave script in the octave-3.6.3 directory.

The above steps assume a proper and standard CUDA 5.0 install. I will try to respond to CUDA-specific questions or issues, but there are any number of problems that may arise with a general Octave install on your platform. I'm not an octave expert and I won't be able to respond to those. I used CentOS 6.2 for this test.

This method, as indicated, involves modification of the C source files of octave.

Another method was covered in some detail in the S3527 session at the GTC 2013 GPU Tech Conference. This session was actually a hands-on laboratory exercise. Unfortunately the materials on that are not conveniently available. However the method there did not involve any modification of GNU Octave source, but instead uses the LD_PRELOAD capability of Linux to intercept the BLAS library calls and re-direct (the appropriate ones) to the cublas library.

A newer, better method (using the NVBLAS intercept library) is discussed in this blog article

Bit-Man
  • 516
  • 4
  • 17
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thank you very much. I'll give this another shot and report back. – Waylon Flinn Jul 05 '13 at 17:04
  • @Joshua Fleming leave your answer edits as comments rather than editing my answer please. My answer is tested and correct for the OS(and gcc) and Octave versions I specified. If you want to add information for another version of gcc please do it in the comments. – Robert Crovella Dec 30 '13 at 23:54
  • This still works great for Octave version 4.0.0, although the file to edit is now located at "octave-4.0.0/liboctave/array/dMatrix.cc". Distro specific notes: Gentoo puts everything cuda related into /opt/cuda, so adjust paths accordingly. Debian 8 puts cuda headers & libraries in standard /usr locations, so no paths necessary; however "fortran_thunking.c" is hidden in "/usr/share/doc/nvidia-cuda-doc/examples/". – jlh Oct 09 '15 at 17:50
  • Another important comment: The above modification of "dMatrix.cc" only handles matrices of type 'double'. However, it's the 'float' type that GPUs really excel at, therefore you really want to also edit "fMatrix.cc" and replace "sgemm" with "cublas_sgemm". This is of course only useful if your octave code uses the 'float' type. You may also want to look at "ssyrk"/"dsyrk" and other BLAS calls that may or may not be faster on the GPU. – jlh Oct 10 '15 at 13:35
  • I tried compiling Octave 4.0.0 with these modifications, but it fails to link. At the final step I just get `../liboctave/.libs/liboctave.so: undefined reference to 'cublasCsyr2k'` as well as a lot of other packages that the linker can't find references to. I followed the steps exactly as shown in this post and also changed my LDFLAGS to include cuda, but I'm still having issues with linking. Any ideas? – John Corbett Dec 07 '15 at 01:49
  • 1
    1. Did you install CUDA in the default location (i.e. is libcublas.so located at /usr/local/cuda/lib64 ?) 2. I'm not sure what you mean by "other packages". `cublasCsyr2k` is not a package, it's an entry point to a library. If all the "other packages" you are referring to are actually `cublas` calls, then they probably all point to the same problem. Did you modify any files in the Octave distribution besides `dMatrix.cc` ? jlh seems to think it works with Octave 4.0.0 – Robert Crovella Dec 07 '15 at 01:58
  • I did install CUDA in the default location, although I am using CUDA-7.5, so I don't know if that would affect anything. And yes, that's the only file I modified. Everything compiles fine, it just doesn't want to link against cublas for some reason. – John Corbett Dec 07 '15 at 02:16
  • It might be a linking order issue. It might be cumbersome trying to sort it out in the comments here. – Robert Crovella Dec 07 '15 at 02:19
  • I understand. Thanks for replying anyway. How might one go about debugging this issue then? – John Corbett Dec 07 '15 at 02:32
  • Well, you could post a new question, giving a complete description of the problem, as well as your setup, software versions, changes you made, etc. If you want to focus on the linking-order theory, you could try to inspect the actual link command line being issued by make that is failing, and see if you can re-issue that command line only, manually, and [modify the location within that command line](http://stackoverflow.com/questions/45135/why-does-the-order-in-which-libraries-are-linked-sometimes-cause-errors-in-gcc) of the `-L...` and `-l...` switches, and see if it makes a difference. – Robert Crovella Dec 07 '15 at 02:37
  • 1
    For example, one possible test on the linking order theory would be to change this: `./configure --disable-docs LDFLAGS="-L/usr/local/cuda/lib64 -lcublas -lcudart -L/home/user2/octave/octave-3.6.3/mycublas -lmycublas"` to this: `./configure --disable-docs LDFLAGS="-L/home/user2/octave/octave-3.6.3/mycublas -lmycublas -L/usr/local/cuda/lib64 -lcublas -lcudart"` in step 8 above. (Make modifications to the paths specified by the `-L` switches if necessary). – Robert Crovella Dec 07 '15 at 02:50
2

I was able to produce a compiled executable using the information supplied. It's a horrible hack, but it works.

The process looks like this:

First produce an object file for fortran_thunking.c

sudo /usr/local/cuda-5.0/bin/nvcc -O3 -c -DCUBLAS_GFORTRAN fortran_thunking.c

Then move that object file to the src subdirectory in octave

cp /usr/local/cuda-5.0/src/fortran_thunking.o ./octave/src

run make. The compile will fail on the last step. Change to the src directory.

cd src

Then execute the failing final line with the addition of ./fortran_thunking.o -lcudart -lcublas just after octave-main.o. This produces the following command

g++ -I/usr/include/freetype2 -Wall -W -Wshadow -Wold-style-cast -Wformat
 -Wpointer-arith -Wwrite-strings -Wcast-align -Wcast-qual
 -I/usr/local/cuda/include -o .libs/octave octave-main.o 
./fortran_thunking.o -lcudart -lcublas  -L/usr/local/cuda/lib64 
../libgui/.libs/liboctgui.so ../libinterp/.libs/liboctinterp.so 
../liboctave/.libs/liboctave.so -lutil -lm -lpthread -Wl,-rpath 
-Wl,/usr/local/lib/octave/3.7.5

An octave binary will be created in the src/.libs directory. This is your octave executable.

Waylon Flinn
  • 19,969
  • 15
  • 70
  • 72
  • Good job! I'm still building octave. If you don't mind, I'm going to gather together what you've done plus what's been discussed already into a single set of coherent instructions in my answer. But I upvoted yours. – Robert Crovella Jul 06 '13 at 05:24
2

In a most recent version of CUDA you don't have to recompile anything. At least as I found in Debian. First, create a config file for NVBLAS (a cuBLAS wrapper). It won't work without it, at all.

tee nvblas.conf <<EOF
NVBLAS_CPU_BLAS_LIB $(dpkg -L libopenblas-base | grep libblas)
NVBLAS_GPU_LIST ALL
EOF

Then use Octave as you would usually do running it with:

LD_PRELOAD=libnvblas.so octave

NVBLAS will do what it can on a GPU while relaying everything else to OpenBLAS.

Further reading:

Worth noting that you may not enjoy all the benefits of GPU computing depending on used CPU/GPU: OpenBLAS is quite fast with current multi-core processors. So fast that time spend copying data to GPU, working on it, and copying back could come close to time needed to do the job right on CPU. Check for yourself. Though GPUs are usually more energy efficient.

Community
  • 1
  • 1
sanmai
  • 29,083
  • 12
  • 64
  • 76
  • Confirmed this still works after all these years on octave 5.2.0. The hard part is making sure the proprietary nvidia device drivers are installed and running correctly. The `libnvblas` is installed correctly and `nvblas.conf` is configured correctly. Then it's as advertised: 1.3 teraflops on a 1080TI GPU instead of 3 gigaflops on a single cpu core. Prepare for a metal-melting 95C heat coming off your GPU if your matrix is big, or you don't monitor GPU temperature to throttle back on overheating. – Eric Leschinski Jan 03 '22 at 03:12
  • I found that Octave 5.2 needs to work with a phtreads compiled version of libopenblas. Ubuntu now has a package that includes an OpenMP version as well, but that didn't seem to work. – Brian Borchers Dec 24 '22 at 19:18