I am not sure what is going on here. If I take a very simple DGEMM example (cribbed directly from the MKL fortran guide):
PROGRAM MAIN
IMPLICIT NONE
DOUBLE PRECISION ALPHA, BETA
INTEGER M, K, N, I, J
PARAMETER (M=8000, K=8000, N=8000)
DOUBLE PRECISION A(M,K), B(K,N), C(M,N)
PRINT *, "Initializing data for matrix multiplication C=A*B for "
PRINT 10, " matrix A(",M," x",K, ") and matrix B(", K," x", N, ")"
10 FORMAT(a,I5,a,I5,a,I5,a,I5,a)
PRINT *, ""
ALPHA = 1.0
BETA = 0.0
PRINT *, "Intializing matrix data"
PRINT *, ""
DO I = 1, M
DO J = 1, K
A(I,J) = (I-1) * K + J
END DO
END DO
DO I = 1, K
DO J = 1, N
B(I,J) = -((I-1) * N + J)
END DO
END DO
DO I = 1, M
DO J = 1, N
C(I,J) = 0.0
END DO
END DO
PRINT *, "Computing matrix product using DGEMM subroutine"
CALL DGEMM('N','N',M,N,K,ALPHA,A,M,B,K,BETA,C,M)
PRINT *, "Computations completed."
PRINT *, ""
PRINT *, "Top left corner of matrix A:"
PRINT 20, ((A(I,J), J = 1,MIN(K,6)), I = 1,MIN(M,6))
PRINT *, ""
PRINT *, "Top left corner of matrix B:"
PRINT 20, ((B(I,J),J = 1,MIN(N,6)), I = 1,MIN(K,6))
PRINT *, ""
20 FORMAT(6(F12.0,1x))
PRINT *, "Top left corner of matrix C:"
PRINT 30, ((C(I,J), J = 1,MIN(N,6)), I = 1,MIN(M,6))
PRINT *, ""
30 FORMAT(6(ES12.4,1x))
PRINT *, "Example completed."
STOP
END
If I build the code with the Intel compiler (12.1) and run it under nvprof (note I don't have access to MKL at the moment so I am using OpenBLAS built with ifort):
$ ifort -o nvblas_test nvblas_test.f -L/opt/cuda-7.5/lib64 -lnvblas
$ echo -e "NVBLAS_CPU_BLAS_LIB /opt/openblas/lib/libopenblas.so\nNVBLAS_AUTOPIN_MEM_ENABLED\n" > nvblas.conf
$ nvprof --print-gpu-summary ./nvblas_test
==23978== NVPROF is profiling process 23978, command: ./nvblas_test
[NVBLAS] Config parsed
Initializing data for matrix multiplication C=A*B for
matrix A( 8000 x 8000) and matrix B( 8000 x 8000)
Intializing matrix data
Computing matrix product using DGEMM subroutine
Computations completed.
Top left corner of matrix A:
1. 2. 3. 4. 5. 6.
8001. 8002. 8003. 8004. 8005. 8006.
16001. 16002. 16003. 16004. 16005. 16006.
24001. 24002. 24003. 24004. 24005. 24006.
32001. 32002. 32003. 32004. 32005. 32006.
40001. 40002. 40003. 40004. 40005. 40006.
Top left corner of matrix B:
-1. -2. -3. -4. -5. -6.
-8001. -8002. -8003. -8004. -8005. -8006.
-16001. -16002. -16003. -16004. -16005. -16006.
-24001. -24002. -24003. -24004. -24005. -24006.
-32001. -32002. -32003. -32004. -32005. -32006.
-40001. -40002. -40003. -40004. -40005. -40006.
Top left corner of matrix C:
-1.3653E+15 -1.3653E+15 -1.3653E+15 -1.3653E+15 -1.3653E+15 -1.3653E+15
-3.4131E+15 -3.4131E+15 -3.4131E+15 -3.4131E+15 -3.4131E+15 -3.4131E+15
-5.4608E+15 -5.4608E+15 -5.4608E+15 -5.4608E+15 -5.4608E+15 -5.4608E+15
-7.5086E+15 -7.5086E+15 -7.5086E+15 -7.5086E+15 -7.5086E+15 -7.5086E+15
-9.5563E+15 -9.5563E+15 -9.5563E+15 -9.5563E+15 -9.5563E+15 -9.5563E+15
-1.1604E+16 -1.1604E+16 -1.1604E+16 -1.1604E+16 -1.1604E+16 -1.1604E+16
Example completed.
==23978== Profiling application: ./nvblas_test
==23978== Profiling result:
Time(%) Time Calls Avg Min Max Name
92.15% 8.56855s 512 16.736ms 9.6488ms 21.520ms void magma_lds128_dgemm_kernel<bool=0, bool=0, int=5, int=5, int=3, int=3, int=3>(int, int, int, double const *, int, double const *, int, double*, int, int, int, double const *, double const *, double, double, int)
7.38% 685.77ms 1025 669.04us 896ns 820.55us [CUDA memcpy HtoD]
0.47% 44.017ms 64 687.77us 504.56us 763.05us [CUDA memcpy DtoH]
I get what I expect - offload of the DGEMM call to the GPU. When I do this:
$ echo "NVBLAS_GPU_DISABLED_DGEMM" >> nvblas.conf
$ nvprof --print-gpu-summary ./nvblas_test
==23991== NVPROF is profiling process 23991, command: ./nvblas_test
[NVBLAS] Config parsed
Initializing data for matrix multiplication C=A*B for
matrix A( 8000 x 8000) and matrix B( 8000 x 8000)
Intializing matrix data
Computing matrix product using DGEMM subroutine
Computations completed.
Top left corner of matrix A:
1. 2. 3. 4. 5. 6.
8001. 8002. 8003. 8004. 8005. 8006.
16001. 16002. 16003. 16004. 16005. 16006.
24001. 24002. 24003. 24004. 24005. 24006.
32001. 32002. 32003. 32004. 32005. 32006.
40001. 40002. 40003. 40004. 40005. 40006.
Top left corner of matrix B:
-1. -2. -3. -4. -5. -6.
-8001. -8002. -8003. -8004. -8005. -8006.
-16001. -16002. -16003. -16004. -16005. -16006.
-24001. -24002. -24003. -24004. -24005. -24006.
-32001. -32002. -32003. -32004. -32005. -32006.
-40001. -40002. -40003. -40004. -40005. -40006.
Top left corner of matrix C:
-1.3653E+15 -1.3653E+15 -1.3653E+15 -1.3653E+15 -1.3653E+15 -1.3653E+15
-3.4131E+15 -3.4131E+15 -3.4131E+15 -3.4131E+15 -3.4131E+15 -3.4131E+15
-5.4608E+15 -5.4608E+15 -5.4608E+15 -5.4608E+15 -5.4608E+15 -5.4608E+15
-7.5086E+15 -7.5086E+15 -7.5086E+15 -7.5086E+15 -7.5086E+15 -7.5086E+15
-9.5563E+15 -9.5563E+15 -9.5563E+15 -9.5563E+15 -9.5563E+15 -9.5563E+15
-1.1604E+16 -1.1604E+16 -1.1604E+16 -1.1604E+16 -1.1604E+16 -1.1604E+16
Example completed.
==23991== Profiling application: ./nvblas_test
==23991== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 768ns 1 768ns 768ns 768ns [CUDA memcpy HtoD]
I get no offload to the GPU. If you can't reproduce this, then the problem is either with your compiler version (you haven't said which one you are using), if you can, then perhaps the somewhat fancier build options you are using are interacting with NVBLAS in an unexpected way