0

Below is the code where i get Segmentation fault when i am trying to print the matrix d_A which is being copied from host matrix h_A.when i am trying to print matrix h_A just before cudamalloc it gets printed but after cudamemcpy trying to print d_A(Device matrix) gives me error.

I am using the following:- nvcc -arch=sm_20 Trial.cu -o out to compile

  #include <stdio.h>
  #include <sstream> 
  #include <stdlib.h> 
  #include <time.h> 
  #include <math.h> 
  #include <unistd.h> 
  #include <sys/time.h> 
  #include <stdint.h>
  #include <cuda.h> 
  #include <time.h> 
  inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
  {
     if (code != cudaSuccess)
     {
       fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
       if (abort) exit(code);
     }
  }

  void LUdecomposition(float *h_A,float *A_,int dim,unsigned int size_A,int row_A)
  { 
    float *d_A;int k;
    gpuErrchk(cudaMalloc(&d_A, size_A*sizeof(float)));
    gpuErrchk(cudaMemcpy(d_A, h_A, size_A*sizeof(float), cudaMemcpyHostToDevice));

    printf("\n D_A");

    gpuErrchk(cudaMemcpy(A_,d_A,size_A*sizeof(float), cudaMemcpyDeviceToHost));

    for(int i=0; i<size_A; i++)
    {

            if (i % row_A == 0) printf("\n");
            printf("%f ", A_[i]);

    }
    printf("\n D_A");      
  }
  void input_matrix_generation_A(float *Matrix, unsigned int row, unsigned int column,  unsigned int size)
  {

    for (int i=0; i<size; i++)
    {
            Matrix[i] = rand()%5+1;
            if (i % column == 0) printf("\n");
    }       

  }       
  int main(int argc, char *argv[])
  {
    int m=4;int dim=2;

    int size_A=m*m;
    float *A, *A_;

    A = (float*)malloc(sizeof(float)*size_A);
    input_matrix_generation_A(A,m,m,size_A);

    A_ = (float*)malloc(sizeof(float)*size_A);
    LUdecomposition(A,A_,dim,size_A,m);
     for(int i=0; i<size_A; i++)
    {

            if (i % row_A == 0) printf("\n");
            printf("%f ", A_[i]);

    }

    return 0;
   }
Usuwi
  • 129
  • 1
  • 2
  • 15
  • 2
    In addition to the answers already provided below. Do not forget to do the [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). – Sagar Masuti Dec 02 '13 at 06:50

2 Answers2

2

You are trying to access (de-reference) a device pointer from the host, which is resulting in undefined behavior and causing segmentation fault. So the following line of code is invalid:

printf("%f ", d_A[i]);

Also, you are copying back extra amount of memory:

cudaMemcpy(A_,d_A,size_A*sizeof(double), cudaMemcpyDeviceToHost);

It should be

cudaMemcpy(A_,d_A,size_A*sizeof(float), cudaMemcpyDeviceToHost);
sgarizvi
  • 16,623
  • 9
  • 64
  • 98
  • Actually i want to print the output matrix at the host code ,should i do it in main and try to print A_[i] but even that gives error,where can i print the final matrix and is there any way to print the device matrix d_A and also the final matrix A_ which i am trying to print at host side. Appreciate your help – Usuwi Dec 02 '13 at 07:10
  • @Ankit... If you want to print the output matrix, you just have to copy `d_A` to the host and print it just like you are doing it now. But make sure there are no statements which can cause undefined behavior. Also add error checking to all the CUDA api calls, as Sagar Masuti has specified in the comment. – sgarizvi Dec 02 '13 at 07:19
  • Considering your device compute capability (2.0), you can do it in device code with regular `printf`. See [this post](http://stackoverflow.com/a/6586329/2386951). – Farzad Dec 02 '13 at 07:22
  • Actually i am trying to print the matrix A_ which is the host matrix by copying from device to host after this line cudaMemcpy(A_,d_A,size_A*sizeof(float), cudaMemcpyDeviceToHost); but it prints 0.0 for all the position in Array ,also if i try printing A_ in main function(which is the host code) still gives 0.0 for all positions in Array A_[] ..why am i getting 0.0? or am i printing it wrong? – Usuwi Dec 02 '13 at 07:55
  • @Ankit... Have you removed the part where you are accessing the device memory on host? `printf("%f ", d_A[i]);`. I ran the given code, and after removing this line, it is working fine. – sgarizvi Dec 02 '13 at 08:09
  • I have edited the above code that i originally posted and i am trying to execute the same code,I am printing A_ inside LUdecomposition after cudaMemcpy(A_,d_A,size_A*sizeof(float), cudaMemcpyDeviceToHost); and inside main function after call to LUdecomposition but it still prints 0.0 for all the values of the matrix ..are you printing inside LUdecomposition or main ? – Usuwi Dec 02 '13 at 08:28
  • @Ankit... Most probably one or more of the CUDA api calls are failing. You still haven't added [CUDA error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) despite the recommendation. I am sure you will get to the root cause after adding error checks. – sgarizvi Dec 02 '13 at 08:38
  • Actually i have made some changes not sure if i have done it correctly and have wrapped the API calls in gpuErrchk in the above code.but it gives the error :- GPUassert: CUDA driver version is insufficient for CUDA runtime version Trial.cu 23 . – Usuwi Dec 02 '13 at 08:51
  • @Ankit...See, you got the error. Update the graphics driver, because the graphics driver installed in the system is incompatible with the CUDA version. It is recommended to install the driver shipped with the CUDA toolkit. – sgarizvi Dec 02 '13 at 10:21
  • I am using cluster for running my jobs and the cuda version it shows is cuda/4.2.9 and i am just loading cuda by shell comman module load cuda and directly running the shell script by sbatch command on cluster .please correct me if i am wrong .do i need some other drivers ? – Usuwi Dec 02 '13 at 13:08
1

In your code at about line 23, you write:

for(int i=0; i<size_A; i++)
{
    if (i % row_A == 0) printf("\n");
    printf("%f ", d_A[i]);
}

and this is the part that triggers the segment fault.

Please notice that the device pointer d_A is in the memory space of global memory on GPU, and shall be never de-referenced directly on CPU side.

starrify
  • 14,307
  • 5
  • 33
  • 50
  • Actually i want to print the output matrix at the host code ,should i do it in main and try to print A_[i] but even that gives error,where can i print the final matrix and is there any way to print the device matrix d_A and also the final matrix A_ which i am trying to print at host side. Appreciate your help – Usuwi Dec 02 '13 at 07:04