0

I seem to be missing something when attempting to use NVBLAS with the Intel Fortran compilers.

I appear to be linking and using nvblas.conf correctly as I see feedback from the initialization of NVBLAS at runtime. However, NVBLAS does not seem to be intercepting the calls to DGEMM as only the CPU implementation is executed. This is despite using:

NVBLAS_CPU_RATIO_CGEMM 0.0 

in nvblas.conf (or removing it entirely).

If I disable access to the CPU BLAS implementation by removing:

NVBLAS_CPU_BLAS_LIB  /ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs/libmkl_rt.so

the program crashes at runtime, as I would expect.

The compiler options I am currently using are shown below, I have also tried manually linking MKL, but with the same results.

# Compiler options
FFLAGS=-O3 -axAVX,SSE4.2 -msse3 -align array32byte -fpe1 -fno-alias -openmp -mkl=parallel -heap-arrays 32

 # Linker options
LDFLAGS= -L/ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs -lnvblas

# List of libraries used
LIBS= -L/ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs -lnvblas

An example of a call to DGEMM is as follows:

call dgemm('N','T',nCols2,nCols1,nOcc(s),2.0d0/dble(nSpins),C2,nRowsP,C(:,:,s),nRowsP,0.0d0,P(i21,i11,s),nOrbsP)

Unfortunately I am currently limited to using the Intel compilers but that restriction will be lifted shortly (at which point I will use CUDA Fortran to optimize data movement).

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Have you tried to use NVBLAS_LOGFILE to see whether the library is emitting any errors you aren't getting to see to stdout/stderr? – talonmies Jan 20 '16 at 09:39
  • Yes, all I see in the logfile/stdout is the following: [NVBLAS] Using devices :0 [NVBLAS] Config parsed – Karl Wilkinson Jan 20 '16 at 10:35
  • OK. Are you really certain that the GPU isn't being used? what happens if you run your compiled app using `nvprof`? do you see any kernels being launched? Also, you are changing NVBLAS_CPU_RATIO_CGEMM, but using DGEMM. Is that a typo or mistake? – talonmies Jan 20 '16 at 10:40
  • @talonmies: Yes, I've double checked with nvprof and the dgemm kernels do not appear to be running. I corrected CGEMM > DGEMM but that doesn't change matters either (I understand it should default to 100% GPU anyway). – Karl Wilkinson Jan 20 '16 at 12:23
  • On a related note (and perhaps a solution to the whole issue), is it necessary to use interfaces when you are using either NVBLAS or the thunking approach? The documentation and examples I have found would suggest not, but it's the first time I have tried either NVBLAS or thunking. I'm implementing the standard approach with interfaces etc at the moment, hopefully I will have access to the PGI compiler later today and I can get back to a familiar set of tools! – Karl Wilkinson Jan 21 '16 at 08:33
  • NVBLAS doesn't require interfaces to work. It works at the linker level. Have your tried just running a "plain" example which call without openMP or anything else in the host fortran? I have a standard example I can post in an answer which "just works" if you want to try it – talonmies Jan 21 '16 at 08:51

1 Answers1

1

I am not sure what is going on here. If I take a very simple DGEMM example (cribbed directly from the MKL fortran guide):

      PROGRAM   MAIN

      IMPLICIT NONE

      DOUBLE PRECISION ALPHA, BETA
      INTEGER          M, K, N, I, J
      PARAMETER        (M=8000, K=8000, N=8000)
      DOUBLE PRECISION A(M,K), B(K,N), C(M,N)


      PRINT *, "Initializing data for matrix multiplication C=A*B for "
      PRINT 10, " matrix A(",M," x",K, ") and matrix B(", K," x", N, ")"
10    FORMAT(a,I5,a,I5,a,I5,a,I5,a)
      PRINT *, ""
      ALPHA = 1.0 
      BETA = 0.0

      PRINT *, "Intializing matrix data"
      PRINT *, ""
      DO I = 1, M
        DO J = 1, K
          A(I,J) = (I-1) * K + J
        END DO
      END DO

      DO I = 1, K
        DO J = 1, N
          B(I,J) = -((I-1) * N + J)
        END DO
      END DO

      DO I = 1, M
        DO J = 1, N
          C(I,J) = 0.0
        END DO
      END DO

      PRINT *, "Computing matrix product using DGEMM subroutine"
      CALL DGEMM('N','N',M,N,K,ALPHA,A,M,B,K,BETA,C,M)
      PRINT *, "Computations completed."
      PRINT *, ""

      PRINT *, "Top left corner of matrix A:"
      PRINT 20, ((A(I,J), J = 1,MIN(K,6)), I = 1,MIN(M,6))
      PRINT *, ""

      PRINT *, "Top left corner of matrix B:"
      PRINT 20, ((B(I,J),J = 1,MIN(N,6)), I = 1,MIN(K,6))
      PRINT *, ""

 20   FORMAT(6(F12.0,1x))

      PRINT *, "Top left corner of matrix C:"
      PRINT 30, ((C(I,J), J = 1,MIN(N,6)), I = 1,MIN(M,6))
      PRINT *, ""

 30   FORMAT(6(ES12.4,1x))

      PRINT *, "Example completed."
      STOP 

      END

If I build the code with the Intel compiler (12.1) and run it under nvprof (note I don't have access to MKL at the moment so I am using OpenBLAS built with ifort):

$ ifort -o nvblas_test nvblas_test.f -L/opt/cuda-7.5/lib64 -lnvblas
$ echo -e "NVBLAS_CPU_BLAS_LIB  /opt/openblas/lib/libopenblas.so\nNVBLAS_AUTOPIN_MEM_ENABLED\n" > nvblas.conf

$ nvprof --print-gpu-summary ./nvblas_test
==23978== NVPROF is profiling process 23978, command: ./nvblas_test
[NVBLAS] Config parsed
 Initializing data for matrix multiplication C=A*B for 
 matrix A( 8000 x 8000) and matrix B( 8000 x 8000)

 Intializing matrix data

 Computing matrix product using DGEMM subroutine
 Computations completed.

 Top left corner of matrix A:
          1.           2.           3.           4.           5.           6.
       8001.        8002.        8003.        8004.        8005.        8006.
      16001.       16002.       16003.       16004.       16005.       16006.
      24001.       24002.       24003.       24004.       24005.       24006.
      32001.       32002.       32003.       32004.       32005.       32006.
      40001.       40002.       40003.       40004.       40005.       40006.

 Top left corner of matrix B:
         -1.          -2.          -3.          -4.          -5.          -6.
      -8001.       -8002.       -8003.       -8004.       -8005.       -8006.
     -16001.      -16002.      -16003.      -16004.      -16005.      -16006.
     -24001.      -24002.      -24003.      -24004.      -24005.      -24006.
     -32001.      -32002.      -32003.      -32004.      -32005.      -32006.
     -40001.      -40002.      -40003.      -40004.      -40005.      -40006.

 Top left corner of matrix C:
 -1.3653E+15  -1.3653E+15  -1.3653E+15  -1.3653E+15  -1.3653E+15  -1.3653E+15
 -3.4131E+15  -3.4131E+15  -3.4131E+15  -3.4131E+15  -3.4131E+15  -3.4131E+15
 -5.4608E+15  -5.4608E+15  -5.4608E+15  -5.4608E+15  -5.4608E+15  -5.4608E+15
 -7.5086E+15  -7.5086E+15  -7.5086E+15  -7.5086E+15  -7.5086E+15  -7.5086E+15
 -9.5563E+15  -9.5563E+15  -9.5563E+15  -9.5563E+15  -9.5563E+15  -9.5563E+15
 -1.1604E+16  -1.1604E+16  -1.1604E+16  -1.1604E+16  -1.1604E+16  -1.1604E+16

 Example completed.
==23978== Profiling application: ./nvblas_test
==23978== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 92.15%  8.56855s       512  16.736ms  9.6488ms  21.520ms  void magma_lds128_dgemm_kernel<bool=0, bool=0, int=5, int=5, int=3, int=3, int=3>(int, int, int, double const *, int, double const *, int, double*, int, int, int, double const *, double const *, double, double, int)
  7.38%  685.77ms      1025  669.04us     896ns  820.55us  [CUDA memcpy HtoD]
  0.47%  44.017ms        64  687.77us  504.56us  763.05us  [CUDA memcpy DtoH]

I get what I expect - offload of the DGEMM call to the GPU. When I do this:

$ echo "NVBLAS_GPU_DISABLED_DGEMM" >> nvblas.conf 
$ nvprof --print-gpu-summary ./nvblas_test
==23991== NVPROF is profiling process 23991, command: ./nvblas_test
[NVBLAS] Config parsed
 Initializing data for matrix multiplication C=A*B for 
 matrix A( 8000 x 8000) and matrix B( 8000 x 8000)

 Intializing matrix data

 Computing matrix product using DGEMM subroutine
 Computations completed.

 Top left corner of matrix A:
          1.           2.           3.           4.           5.           6.
       8001.        8002.        8003.        8004.        8005.        8006.
      16001.       16002.       16003.       16004.       16005.       16006.
      24001.       24002.       24003.       24004.       24005.       24006.
      32001.       32002.       32003.       32004.       32005.       32006.
      40001.       40002.       40003.       40004.       40005.       40006.

 Top left corner of matrix B:
         -1.          -2.          -3.          -4.          -5.          -6.
      -8001.       -8002.       -8003.       -8004.       -8005.       -8006.
     -16001.      -16002.      -16003.      -16004.      -16005.      -16006.
     -24001.      -24002.      -24003.      -24004.      -24005.      -24006.
     -32001.      -32002.      -32003.      -32004.      -32005.      -32006.
     -40001.      -40002.      -40003.      -40004.      -40005.      -40006.

 Top left corner of matrix C:
 -1.3653E+15  -1.3653E+15  -1.3653E+15  -1.3653E+15  -1.3653E+15  -1.3653E+15
 -3.4131E+15  -3.4131E+15  -3.4131E+15  -3.4131E+15  -3.4131E+15  -3.4131E+15
 -5.4608E+15  -5.4608E+15  -5.4608E+15  -5.4608E+15  -5.4608E+15  -5.4608E+15
 -7.5086E+15  -7.5086E+15  -7.5086E+15  -7.5086E+15  -7.5086E+15  -7.5086E+15
 -9.5563E+15  -9.5563E+15  -9.5563E+15  -9.5563E+15  -9.5563E+15  -9.5563E+15
 -1.1604E+16  -1.1604E+16  -1.1604E+16  -1.1604E+16  -1.1604E+16  -1.1604E+16

 Example completed.
==23991== Profiling application: ./nvblas_test
==23991== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%     768ns         1     768ns     768ns     768ns  [CUDA memcpy HtoD]

I get no offload to the GPU. If you can't reproduce this, then the problem is either with your compiler version (you haven't said which one you are using), if you can, then perhaps the somewhat fancier build options you are using are interacting with NVBLAS in an unexpected way

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Thanks for that, I can replicate your example with MKL. I am using ifort v16.0.1 by the way. However, in my code it is wrapped by the HDF5 compiler. I will try increasing the complexity of this simple example to try and break it. – Karl Wilkinson Jan 21 '16 at 12:50