I am new in Cuda programming. I have access to the Device "Tesla K10". I am working on a complex problem which needs about 20 KBytes of memory per instance of the problem. Now since cuda provides parallelizing, I have decided to use 96 threads (keeping in mind about warps) per block to solve an instance of the problem. Now issue is i have a very very large number of such problems to be solved (say more than 1,600,000). I am aware that such large memory requirement will not fit even in global memory (which in my case is 3.5 GBytes as shown in the DeviceQuery output below) so i have to solve using few number of problems at a time.
Also, I have mapped each problem with each block to solve an instance of the problem.
Now at present I am able to solve large number of problems with the data in the global memory. But Shared Memory being faster than global so i am planning to use the shared memory 20 KBytes (per problem).
1) Now my confusion is this will permit me only 2 problems to be loaded in the shared memory to be solved at a time( i.e., 40KBytes < 48 KBytes of shared memory). (based on my understanding about cuda, please correct me if i am worng).
2) If i declare array with this 20 KBytes in the kernel does it mean that this (20KBytes * number_of_blocks) will be the shared memory use? By number_of_blocks i mean the number of problems to be solved. My launch configuration is problem<>>(...)
All Your help in this regard will be highly acknowledge. Thanking you in advance.
***My partial Device Query***
Device : "Tesla K10.G1.8GB"
CUDA Driver Version / Runtime Version 6.5 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 3584 MBytes (3757637632 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Clock rate: 745 MHz (0.75 GHz)
Memory Clock rate: 524 Mhz
Memory Bus Width: 2048-bit
L2 Cache Size: 4204060 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 2046), 65536 layers
Total amount of constant memory: 65536 bytes
**Total amount of shared memory per block: 49152 bytes**
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 0
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
...