Profiling a CUDA matrix addition code, using nvprof: the code API profiles, the kernel does not

Question

I am using a remote workstation with nvidia Geforce Gpu , and after compiling and executing when I try to profile this shows up in the screen result after nvprof command

and this is the output when i run nvidia-smi Output of nvidia-smi command

#include <stdio.h>
#include <cuda.h>
#include <math.h>

__global__ void matrixInit(double *matrix, int width, int height, double value){
    for(int i = (threadIdx.x + blockIdx.x * blockDim.x); i<width; i+=(blockDim.x * gridDim.x)){
        for(int j = (threadIdx.y + blockIdx.y * blockDim.y); j<height; j+=(blockDim.y * gridDim.y)){
            matrix[j * width +i] = value;
        }
    }
}

__global__ void matrixAdd(double *d_A, double *d_B, double *d_C, int width, int height){
    int ix = threadIdx.x + blockIdx.x * blockDim.x;
    int iy = threadIdx.y + blockIdx.y * blockDim.y;

    int stride_x = blockDim.x * gridDim.x;
    int stride_y = blockDim.y * gridDim.y;

    for(int j=iy; j<height; j+=stride_y){
        for(int i=ix; i<width; i+=stride_x){
            int index = j * width +i;
           d_C[index] = d_A[index-1] + d_B[index];
        }
    }
}

int main(){
    int Nx = 1<<12;
    int Ny = 1<<15;


    size_t size = Nx*Ny*sizeof(double);

 // host memory pointers
    double *A, *B, *C;

 // device memory pointers
    double *d_A, *d_B, *d_C;

    // allocate host memory
    A = (double*)malloc(size);
    B = (double*)malloc(size);
    C = (double*)malloc(size);

    // kernel call
    int thread = 32;
    int block_x = ceil(Nx + thread -1)/thread;
    int block_y = ceil(Ny + thread -1)/thread;

    dim3 THREADS(thread,thread);
    dim3 BLOCKS(block_y,block_x);

    // initialize variables
    matrixInit<<<BLOCKS,THREADS>>>(A, Nx, Ny, 1.0);
    matrixInit<<<BLOCKS,THREADS>>>(B, Nx, Ny, 2.0);
    matrixInit<<<BLOCKS,THREADS>>>(C, Nx, Ny, 0.0);

    //allocated device memory

    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);


//copy to device
    cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);


// Add matrix at GPU
    matrixAdd<<<BLOCKS,THREADS>>>(A, B, C, Nx, Ny);

//copy back to host
    cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);

    cudaFree(A);
    cudaFree(B);
    cudaFree(C);

    return 0;

}

This is my code. in summary, the result shows these 2 warning messages:

==525867== Warning: 4 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.                
==525867== Warning: 1 records have invalid timestamps due to insufficient semaphore pool size. You can configure the pool size using the option --profiling-semaphore-pool-size. 
==525867== Profiling result: No kernels were profiled.

On [your previous question](https://stackoverflow.com/questions/69634659/using-matrix-addition-in-cuda-c-code-executes-but-when-profiling-it-with-nvprof) I asked for the output from `nvidia-smi` (and I did not mean just the first 15 characters of that output) and I also suggested that you add [proper CUDA error checking](https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). — Robert Crovella, Oct 20 '21 at 20:30
oh yeah. thank you. fixed. I hv added the output of Nvidia-smi command screen shot. I have also improved the code. — Fasil, Oct 20 '21 at 20:57
You don't have proper CUDA error checking in your code. And I'm not able to determine what kind of GPU you have from the nvidia-smi output. Please run the following command instead: `nvidia-smi -q |grep -i name` and provide the result of that command. — Robert Crovella, Oct 20 '21 at 21:00
I apologize I am new to cuda programming. your command returns : NVIDIA GeForce GTX 980 — Fasil, Oct 20 '21 at 21:07

Hugo Maxwell · Answer 1 · 2021-10-21T10:24:47.100

matrixInit<<<BLOCKS,THREADS>>>(A, Nx, Ny, 1.0);
matrixInit<<<BLOCKS,THREADS>>>(B, Nx, Ny, 2.0);
matrixInit<<<BLOCKS,THREADS>>>(C, Nx, Ny, 0.0);

You are writing to host memory here, which is not allowed.

Instead you can do matrixInit() directly on the device arrays d_A, d_B and d_C, after the allocations.

Another mistake here:

cudaFree(A);
cudaFree(B);
cudaFree(C);

Those should be d_A, d_B and d_C. Use regular free() for A, B and C.

Your kernels are also not doing what you want. You launch them with a thread per matrix entry, which means there should be no for() loops in the kernels.

Profiling a CUDA matrix addition code, using nvprof: the code API profiles, the kernel does not

1 Answers1

Linked

Related