Using Matrix addition in cuda c,code executes but when profiling it with nvprof.It says NO kernels are profiled

Question

nvprof profiles The API just fine. But says No kernels were profiled. It shows these 2 warning messages " ==525867== Warning: 4 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size. ==525867== Warning: 1 records have invalid timestamps due to insufficient semaphore pool size. You can configure the pool size using the option --profiling-semaphore-pool-size. ==525867== Profiling result: No kernels were profiled." I am using NVIDIA GeForce GPU.

#include <stdio.h>
#include <cuda.h>
#include <time.h>
#include <cuda_profiler_api.h>



__global__ void matrixInit(float *m, int N_1, int N_2, int value){
    unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int iy = threadIdx.y + blockIdx.y * blockDim.y;
    unsigned int strideX = blockDim.x * gridDim.x;
    unsigned int strideY = blockDim.y * gridDim.y;

    for(int j=iy; j<N_2; j+=strideY){
        for(int i=ix; i<N_1; i+=strideX){
            m[j*N_1+i] = value;
        }
    }
}


__global__ void matrixAdd(float *d_A, float *d_B, float *d_C, int N_1, int N_2){
    // indexes and strides in 2d

    unsigned int ix = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int iy = threadIdx.y + blockIdx.y * blockDim.y;
    unsigned int strideX = blockDim.x * gridDim.x;
    unsigned int strideY = blockDim.y * gridDim.y;

    for(int j=iy; j<N_2; j+=strideY){
        for(int i=ix; i<N_1; i+=strideX){
            d_C[i] = d_A[j*N_1+i]+d_B[j*N_1+i];
        }
    }
}

int main() {


    int N_1 = 1 << 12;
    int N_2 = 1 << 15;


    //Size

 int N_1_2 = N_1 * N_2;

 // host memory pointers
    float *A, *B, *C;

 // device memory pointers
    float *d_A, *d_B, *d_C;

  clock_t t = clock();

  size_t bytes = N_1_2*sizeof(float);

// allocate host memory
    A = (float*)malloc(bytes);
    B = (float*)malloc(bytes);
    C = (float*)malloc(bytes);


//set  dimensions for 1d

int threadsPerBlock=32;
dim3 threads(threadsPerBlock,threadsPerBlock);
dim3 numBlocks( N_1/threads.x, N_2/threads.y);
printf(" Grid Size of X: %d Grid Size of Y: %d \n ",threads.x,threads.y);


//Initialize
    matrixInit<<<numBlocks,threads>>>(A,N_1, N_2, 1.0f);
    matrixInit<<<numBlocks,threads>>>(B,N_1, N_2, 2.0f);
    matrixInit<<<numBlocks,threads>>>(C,N_1, N_2, 0.0f);



   //allocated device memory


    cudaMalloc(&d_A, bytes);
    cudaMalloc(&d_B, bytes);
    cudaMalloc(&d_C, bytes);

    //copy to device
    cudaMemcpy(d_A, A, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, B, bytes, cudaMemcpyHostToDevice);


    matrixAdd<<<numBlocks,threads>>>(d_A, d_B, d_C, N_1, N_2);

    //copy back to host
    cudaMemcpy(C, d_C, bytes, cudaMemcpyDeviceToHost);

    t = clock() - t;


    printf("Program executed at %f seconds", ((float)t) / CLOCKS_PER_SEC);

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

cudaProfilerStop();


    return 0;
    }

Using Matrix addition in cuda c,code executes but when profiling it with nvprof.It says NO kernels profiled.

what GPU are you running on? Please paste the exact output from the nvprof command in your question. — Robert Crovella, Oct 19 '21 at 17:18
I am running this using a remote workstation that has Nvidia. Exact out put is: root@DESKTOP-OSH9DG0:~/assign# nvprof ./tester_16 ==455== NVPROF is profiling process 455, command: ./tester_16 Program executed at 0.093750 seconds==455== Profiling application: ./tester_16 ==455== Profiling result: No kernels were profiled. ==455== API calls: No API activities were profiled. root@DESKTOP-OSH9DG0:~/assign# — Fasil, Oct 19 '21 at 17:29
Need to know what NVIDIA GPU is in that remote workstation. Try running `nvidia-smi` there and pasting the result into your question. — Robert Crovella, Oct 19 '21 at 17:31
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. — Fasil, Oct 19 '21 at 17:37
so that machine is broken then. It has a broken CUDA install. It also means that the profiler will not work, and your code is not executing correctly. Your posted code has no CUDA error checking and also no results checking, so it's quite possible that even though you said "code executes" it is not actually executing correctly. — Robert Crovella, Oct 19 '21 at 17:40
Robert Crovella; thanks for the input brother. from what you said , if Nvidia-smi returned such result and if the machine is broken, the issues that are within the code will remain unclear. — Fasil, Oct 19 '21 at 19:55
Correct. You won't be able to determine anything useful. Anytime anyone is having trouble with a CUDA code, I always suggest that they use [proper CUDA error checking](https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api), also. But that isn't going to help much here. — Robert Crovella, Oct 19 '21 at 20:06
I reinstalled my driver and nvprof ,profiles The API just fine. But still No kernels were profiled. It shows these 2 warning messages " ==414064== Warning: 7 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size. ==414064== Warning: 3 records have invalid timestamps due to insufficient semaphore pool size. You can configure the pool size using the option --profiling-semaphore-pool-size." — Fasil, Oct 20 '21 at 09:16
Back to question 1. What GPU are you running this on? I'm aware that it is a NVIDIA GPU. I want to know which type of GPU. Paste the output of `nviida-smi` into your question. You can edit your question with these details. — Robert Crovella, Oct 20 '21 at 13:01

Using Matrix addition in cuda c,code executes but when profiling it with nvprof.It says NO kernels are profiled

0 Answers0

Linked