Now when I run this with size = 2.5*10^5 I get h[0] value to be 0;
When I run this with size = 100*100 I get h[0] value to be 20;
In your kernel launch, you are using this size
variable also:
mykernel<<<size, 1>>>(size, u);
^^^^
On a cc2.0 device (Tesla C2075), this particular parameter in the 1D case is limited to 65535. So 2.5*10^5 exceeds 65535, but 100*100 does not. Therefore, your kernel may be running if you specify size
of 100*100, but is probably not running if you specify size
of 2.5*10^5.
As already suggested to you, proper cuda error checking should point this error out to you, and in general will probably result in you needing to ask far fewer questions on SO, as well as posting higher-quality questions on SO. Take advantage of the CUDA runtime's ability to let you know when things have gone wrong and when you are making a mistake. Then you won't be in a quandary, thinking you have a memory allocation problem when in fact you probably have a kernel launch configuration problem.
How can I allocate more memory to each block?
Although it is probably not your main issue (as indicated above), in-kernel new
and malloc
are limited to the size of the device heap. Once this has been exhausted, further calls to new
or malloc
will return a null pointer. If you use this null pointer anyway, your kernel code will begin to perform unspecified behavior, and will likely crash.
When using new
and malloc
, especially when you're having trouble, it's good practice to check for a null return value. This applies to both host (at least for malloc
) and device code.
The size of the device heap is pretty small to begin with (8MB), but it can be modified.
Referring to the documentation:
The device memory heap has a fixed size that must be specified before any program using malloc() or free() is loaded into the context. A default heap of eight megabytes is allocated if any program uses malloc() without explicitly specifying the heap size.
The following API functions get and set the heap size:
•cudaDeviceGetLimit(size_t* size, cudaLimitMallocHeapSize)
•cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)
The heap size granted will be at least size bytes. cuCtxGetLimit()and cudaDeviceGetLimit() return the currently requested heap size.