0

Seems like there are a lot of questions on here about moving double (or int, or float, etc) 2d arrays from host to device. This is NOT my question.

I have already moved all of the data onto the GPU and, the __global__ kernel calls several __device__ functions.

In these device kernels, I have tried the following:

To allocate:

__device__ double** matrixCreate(int rows, int cols, double initialValue)
{
double** temp; temp=(double**)malloc(rows*sizeof(double*));
for(int j=0;j<rows;j++) {temp[j]=(double*)malloc(cols*sizeof(double));}

//Set  initial  values 

for(int i=0;i<rows;i++)  
{  
    for(int j=0;j<cols;j++)
    {
        temp[i][j]=initialValue;
    }
}

return  temp;
}

To deallocate:

__device__ void  matrixDestroy(double** temp,int rows)
{
for(int j=0;j<rows;j++) { free( temp[j] ); }
free(temp);
}

For single dimension arrays the __device__ mallocs work great, can't seem to keep it stable in the multidimensional case. By the way, the variables are sometime used like this:

double** z=matrixCreate(2,2,0);
double* x=z[0];

However, care is always taken to ensure no calls to free are done with active data. The code is actually an adaption of cpu only code, so I know nothing funny is going on with the pointers or memory. Basically I'm just re-defining the allocators and throwing a __device__ on the serial portions. Just want to run the whole serial bit 10000 times and the GPU seems like a good way to do it.

++++++++++++++ UPDATE +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Problem solved by Vyas. As per cuda specifications, heap size is initially set to 8Mb, if your mallocs exceed this, NSIGHT will not launch and the kernel crashes. Use the following under host code.

float increaseHeap=10;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, size[0]*increaseHeap);

Worked for me!

  • The `malloc` and `free` are OK , assuming you included ``, and you check that each `malloc` doesn't return `NULL`. Perhaps there is memory corruption happening elsewhere in your code that didn't trigger any visible effect in the CPU version. Can you describe your symptoms better and/or post a testcase that reproduces a failure? – M.M Apr 24 '14 at 04:57
  • 1
    @MattMcNabb let's hope very strong this is C. `malloc()` in C++ is harmful. – The Paramagnetic Croissant Apr 24 '14 at 04:58
  • @OP: for starters, [don't cast the return value of standard allocators](http://stackoverflow.com/questions/605845/do-i-cast-the-result-of-malloc/605858#605858). – The Paramagnetic Croissant Apr 24 '14 at 04:59
  • What does "an't seem to keep it stable in the multidimensional case" mean? What exactly is the problem? Why aren't you checking the return value of the `malloc` calls for validity? – talonmies Apr 24 '14 at 05:00
  • @user3477950, `malloc` in C++ is OK for `double`. CUDA code for GPU is a subset of C++ (but NOT a subset or superset of C), so I think C++ tag is appropriate. As such, he needs the casts. – M.M Apr 24 '14 at 05:02
  • 2
    Did you try running your code through `cuda-memcheck`? Also did you try adding checks on the returned pointers from `malloc`? The GPU side `malloc()` is a suballocator from a limited heap. Depending on the number of allocations, it is possible the heap is being exhausted. You can change the size of the backing heap using `cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)`. For more info see : [link](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#dynamic-global-memory-allocation-and-operations) – Vyas Apr 24 '14 at 05:04
  • Can't seem to keep it stable means that it crashes the computer. If I try to debug via NSIGHT, the executable crashes (but the computer doesn't). If I comment out the double** allocations, it still crashes NSIGHT, but will run through the program. Of course, by run through the program, I mean it skips everything useful. Also, device is cc5.0 (using VS2012, cuda toolkit 6.0, compiler set to compute_50,sm_50). – PointerFail Apr 24 '14 at 05:09
  • 1
    If it runs fine for the single dimensional case, have you considered linearising the matrix and addressing it such as `i + j*M`? That way whenever you deal with memory you can just treat it as a single block of memory. – Cramer Apr 24 '14 at 05:09
  • Like I said, this is actually part of a much larger piece of code which was originally designed using double** as storage. Though I could switch to a flat matrix, it is highly undesirable. – PointerFail Apr 24 '14 at 05:12
  • I'd like to try cuda-memcheck but NISIGHT crashes immediately after starting. Interestingly, NSIGHT does work on simpler pieces of code I've put together - so it's not the NSIGHT install. Also, the main function takes no arguments, so that's not it. – PointerFail Apr 24 '14 at 05:15
  • Have you checked the `malloc` results are not NULL as suggested by Vyas? – M.M Apr 24 '14 at 05:20
  • Unfortunately since the NSIGHT debugger crashes, I can't step into the kernel execution. That kind of limits my options for checking malloc results. If I could get NSIGHT working with this code, I'm sure I could trace down what is going on. Initially, I was thinking that if CUDA limited malloc usage, maybe that was causing my NSIGHT crashes. – PointerFail Apr 24 '14 at 05:22
  • Thanks Vyas, it was the heap size. Good call! – PointerFail Apr 24 '14 at 05:27
  • 2
    @Vyas: You would like a add a short answer to get this off the unanswered list? – talonmies Apr 24 '14 at 05:50

1 Answers1

2

The GPU side malloc() is a suballocator from a limited heap. Depending on the number of allocations, it is possible the heap is being exhausted. You can change the size of the backing heap using cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size). For more info see : CUDA programming guide

Vyas
  • 499
  • 2
  • 4