Only able to allocate limited memory using new operator in CUDA

Question

I wrote a cuda kernel like this

__global__ void mykernel(int size; int * h){

double *x[size];
for(int i = 0; i < size; i++){
  x[i] = new double[2];
}
h[0] = 20;
}

void main(){
  int size = 2.5 * 100000 // or 10,000
  int *h = new int[size];
  int *u;
  size_t sizee = size * sizeof(int);
  cudaMalloc(&u, sizee);
  mykernel<<<size, 1>>>(size, u);
  cudaMemcpy(&h, &u, sizee, cudaMemcpyDeviceToHost);
  cout << h[0];
}

I have some other code in the kernel too but I have commented it out. The code above it also allocates some more memory.

Now when I run this with size = 2.5*10^5 I get h[0] value to be 0;

When I run this with size = 100*100 I get h[0] value to be 20;

So I am guessing that my kernels are crashing cause I am running out of memory. I am using a Tesla card C2075 which has ram 2GB ! I even tried this by shutting down the xserver. What I am working on is not even 100mb of data.

How can I allocate more memory to each block?

Also tried running the kernel per thread basis but to no avail. — Aditya, Nov 10 '14 at 08:17
The code has some errors and does not compile. In addition, doing [proper error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) will help you (and us) for sure. — pQB, Nov 10 '14 at 08:18
`2.5 * 10^5` is `2.5 times 10 xor 5` - is that REALLY what you want? — Mats Petersson, Nov 10 '14 at 08:23
Am sorry .. I am new to CUDA. The above code doesn't compile cause its a mock code. The actual code is like over 700 lines. So I identified that problem part from the whole code. I am pretty sure if you debug this code and run it you will not get the same problem. — Aditya, Nov 10 '14 at 08:24
@MatsPetersson, I am sorry its not xor. its 250,000. wait i will change that. — Aditya, Nov 10 '14 at 08:25
Welcome to the GPU computing world :) As you are new to CUDA, the error checking should be **mandatory** in every piece of code you write, until you understand the basic errors related to kernel launch. When you become a more experience programmer, you wont remove the error checking because you may have noticed you can not work without them. That said, without a code to reproduce the error and without any error check people can only guess what is happening... In example, your problem may be related with the stack size? — pQB, Nov 10 '14 at 08:32
@PQB I read through that link you gave. I used cudaGeterror error = cudaGetLastError(); printf(cudaGetErrorString(error)); and I got noerror as message. — Aditya, Nov 10 '14 at 08:41
500 threads. My supervisor told me that there are 900+ cuda cores on the setup.@MatsPetersson — Aditya, Nov 10 '14 at 08:42
You have probably the same problem than [wrong results in cuda](http://stackoverflow.com/questions/26790825/wrong-results-in-cuda/26806164#26806164) question, and probably by the __same reason__. — pQB, Nov 10 '14 at 08:48
@pQB, I have 14 streaming processors, and I tried SM_14 but it says unsupported architecture.My gpu has compute capability 20 — Aditya, Nov 10 '14 at 09:03

score 1 · Accepted Answer · edited May 23 '17 at 12:27

Now when I run this with size = 2.5*10^5 I get h[0] value to be 0;

When I run this with size = 100*100 I get h[0] value to be 20;

In your kernel launch, you are using this size variable also:

mykernel<<<size, 1>>>(size, u);
           ^^^^

On a cc2.0 device (Tesla C2075), this particular parameter in the 1D case is limited to 65535. So 2.5*10^5 exceeds 65535, but 100*100 does not. Therefore, your kernel may be running if you specify size of 100*100, but is probably not running if you specify size of 2.5*10^5.

As already suggested to you, proper cuda error checking should point this error out to you, and in general will probably result in you needing to ask far fewer questions on SO, as well as posting higher-quality questions on SO. Take advantage of the CUDA runtime's ability to let you know when things have gone wrong and when you are making a mistake. Then you won't be in a quandary, thinking you have a memory allocation problem when in fact you probably have a kernel launch configuration problem.

How can I allocate more memory to each block?

Although it is probably not your main issue (as indicated above), in-kernel new and malloc are limited to the size of the device heap. Once this has been exhausted, further calls to new or malloc will return a null pointer. If you use this null pointer anyway, your kernel code will begin to perform unspecified behavior, and will likely crash.

When using new and malloc, especially when you're having trouble, it's good practice to check for a null return value. This applies to both host (at least for malloc) and device code.

The size of the device heap is pretty small to begin with (8MB), but it can be modified.

Referring to the documentation:

The device memory heap has a fixed size that must be specified before any program using malloc() or free() is loaded into the context. A default heap of eight megabytes is allocated if any program uses malloc() without explicitly specifying the heap size.

The following API functions get and set the heap size:

•cudaDeviceGetLimit(size_t* size, cudaLimitMallocHeapSize)

•cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)

The heap size granted will be at least size bytes. cuCtxGetLimit()and cudaDeviceGetLimit() return the currently requested heap size.

I came back from office, by the time you wrote this answer sir, so I will have to try this out tomorrow only. But I wanted to ask you, why it didn't throw any runtime errors when I tried, cudaError_t status = cudaGetLastErrorMessage(); cout << cudaErrorGetErrorMessage(status); — Aditya, Nov 10 '14 at 16:55
Well, according to your own statements, you haven't provided a real code, right? On a Tesla C2075, this could not possibly be correct: `mykernel<<>>(size, u);` for `size = 2.5 * 100000` That first kernel launch parameter is limited to 65535 on cc2.0 devices. And I don't know where you are putting that error checking line of code. If you put it after the kernel call, it should report a failure. Why don't you [provide an MCVE](http://stackoverflow.com/help/mcve)? I'm not asking for *your whole code* but something like what you have posted, that demonstrates the error? — Robert Crovella, Nov 10 '14 at 17:21
I am working on a parallel K means algorithm man. My kernel call has 10 parameters and 5 memallocs and memcpys before and after. The error checking line is exactly after the kernel call. How can I provide a minimal version of this code man.. — Aditya, Nov 10 '14 at 18:05
keep deleting elements of your code until the problem goes away. This is a pretty standard debugging technique, to reduce a problem down to its necessary elements. Did you read the MCVE link? It's pretty instructive. And you could simply take what you've shown in this problem, and, on your own, convert it into a complete, compilable code, that demonstrates the error. After all, if your claim is that the code you've posted is representative of the problem, this should be pretty straightforward. I could do it (but *you* are supposed to), and would end up with the `size` issue already mentioned. — Robert Crovella, Nov 10 '14 at 18:34

Only able to allocate limited memory using new operator in CUDA

1 Answers1