memory fragmentation in cuda

Question

I am having a memory allocation problem which I can't understand. I am trying to allocate a char array in GPU (I am guessing it is probably a memory fragmentation issue).

here is my code,

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<cuda.h>

inline void gpuAssert(cudaError_t code, char *file, int line, 
                 int abort=1)
{  
   if (code != cudaSuccess) {
      printf("GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

__global__ void calc(char *k,char *i)
{
       *i=*k;
}
int main()
{
        char *dev_o=0;
        char *i;
        i = (char*)malloc(10*sizeof(char));

        cudaMalloc((void**)&dev_o,10*sizeof(char)); //Line 31


        calc<<<1,1>>>("arun",dev_o);

        gpuErrchk(cudaMemcpy(&i,dev_o,10*sizeof(char),cudaMemcpyDeviceToHost));

        cudaFree(dev_o);

        printf("string : %s \n",i);

        return 0;
}

but I'm getting output as,

GPUassert: out of memory sample2.cu 31

In the same case, I tried to allocate integer in GPU and it's working properly.

My GPU device information is given as,

--- General Information for device 0 ---
Name:GeForce GTX 460 SE
Compute capability:2.1
Clock rate:1296000
Device copy overlap:Enabled
Kernel execition timeout :Enabled
--- Memory Information for device 0 ---
Total global mem:1073283072
Total constant Mem:65536
Max mem pitch:2147483647
Texture Alignment:512
--- MP Information for device 0 ---
Multiprocessor count:6
Shared mem per mp:49152
Registers per mp:32768
Threads in warp:32
Max threads per block:1024
Max thread dimensions:(1024, 1024, 64)
Max grid dimensions:(65535, 65535, 65535)

Can anyone tell me what is the problem and how I can overcome it?

One of your errors is the `&i` in `cudaMemcpy()`. It should be `i`. — BenC, May 16 '13 at 10:07
Also, you are not checking the possible error created by your kernel call. The error appears in there, you only catch it later. — BenC, May 16 '13 at 10:10
Another error is passing host-pointer to char array as first argument to kernel call. And check return value of `cudaMalloc` and call [`cudaDeviceSynchronize` after kernel to pinpoint error location](http://stackoverflow.com/tags/cuda/info). — aland, May 16 '13 at 10:12
I'd suggest reading first the programming guide and then trying to write code. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html — KiaMorot, May 16 '13 at 11:46

score 2 · Accepted Answer · edited May 23 '17 at 11:43

Several things were wrong in your code.

cudaMemcpy(&i, ...) should be cudaMemcpy(i, ...).
Check the return error of your kernel call as explained in this post. If you don't, the error will seem to appear later in your code.
```
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
```
Your char *k argument to your kernel is a host pointer. You should create another device array and copy your data to the device before calling your kernel.
You were also not doing any parallel work on your threads in your calc() kernel since you were not using the thread indices, threadIdx.x. This was probably for testing though.

Here is what you would get if you fix these issues:

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<cuda.h>

inline void gpuAssert(cudaError_t code, char *file, int line, 
                 int abort=1)
{  
   if (code != cudaSuccess) {
      printf("GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

__global__ void calc(char* k, char *i)
{
       i[threadIdx.x] = k[threadIdx.x];
}

int main()
{
        const char* msg = "arun";

        char *dev_i, *dev_k;
        char *i, *k;

        k = (char*)malloc(10*sizeof(char));
        i = (char*)malloc(10*sizeof(char));

        sprintf(k, msg);

        cudaMalloc((void**)&dev_i, 10*sizeof(char));
        cudaMalloc((void**)&dev_k, 10*sizeof(char));

        gpuErrchk(cudaMemcpy(dev_k, k, 10*sizeof(char), cudaMemcpyHostToDevice));

        calc<<<1,5>>>(dev_k, dev_i);

        gpuErrchk(cudaPeekAtLastError());
        // Synchronization will be done in the next synchronous cudaMemCpy call, else
        // you would need cudaDeviceSynchronize() to detect execution errors.
        //gpuErrchk(cudaDeviceSynchronize());

        gpuErrchk(cudaMemcpy(i, dev_i, 10*sizeof(char), cudaMemcpyDeviceToHost));

        printf("string : %s\n", i);

        cudaFree(dev_i);
        cudaFree(dev_k);
        free(i);
        free(k);

        return 0;
}

The *cudaDeviceSynchronize()* is not necesary as the *cudaMemcpy()* is synchronizing as you are using the default stream 0. — KiaMorot, May 16 '13 at 11:39
Indeed, I will add some more comments in the code. But he seems to be a new CUDA user, and as @talonmies said in his post, "This can be confusing for the beginner, and I would recommend using explicit synchronisation after a kernel launch during debugging to make it easier to understand where problems might be arising." — BenC, May 16 '13 at 11:45
I'm getting "out of memory" error when I run the program.can you help me in recovering this problem? — ParleBoy, May 16 '13 at 13:02
Are you currently using this GPU for rendering or are you using Optimus? — BenC, May 16 '13 at 13:06

memory fragmentation in cuda

1 Answers1