How many blocks can be allocated if i use shared memory?

Question

I am new in Cuda programming. I have access to the Device "Tesla K10". I am working on a complex problem which needs about 20 KBytes of memory per instance of the problem. Now since cuda provides parallelizing, I have decided to use 96 threads (keeping in mind about warps) per block to solve an instance of the problem. Now issue is i have a very very large number of such problems to be solved (say more than 1,600,000). I am aware that such large memory requirement will not fit even in global memory (which in my case is 3.5 GBytes as shown in the DeviceQuery output below) so i have to solve using few number of problems at a time.

Also, I have mapped each problem with each block to solve an instance of the problem.

Now at present I am able to solve large number of problems with the data in the global memory. But Shared Memory being faster than global so i am planning to use the shared memory 20 KBytes (per problem).

1) Now my confusion is this will permit me only 2 problems to be loaded in the shared memory to be solved at a time( i.e., 40KBytes < 48 KBytes of shared memory). (based on my understanding about cuda, please correct me if i am worng).

2) If i declare array with this 20 KBytes in the kernel does it mean that this (20KBytes * number_of_blocks) will be the shared memory use? By number_of_blocks i mean the number of problems to be solved. My launch configuration is problem<>>(...)

All Your help in this regard will be highly acknowledge. Thanking you in advance.

***My partial Device Query***

Device : "Tesla K10.G1.8GB"
  CUDA Driver Version / Runtime Version          6.5 / 5.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 3584 MBytes (3757637632 bytes)
  ( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
  GPU Clock rate:                                745 MHz (0.75 GHz)
  Memory Clock rate:                             524 Mhz
  Memory Bus Width:                              2048-bit
  L2 Cache Size:                                 4204060 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 2046), 65536 layers
  Total amount of constant memory:               65536 bytes
  **Total amount of shared memory per block:       49152 bytes**
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  0
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
...

sorry missing the lunch configuration problem < < < number_of_blocks, 96 > > > ( ... ) — Amit, May 13 '15 at 08:35

score 3 · Accepted Answer · edited May 23 '17 at 11:45

First, a quick summary to check I understand correctly:

You have ~1.5M problems to solve, these are completely independent, i.e. embarrassingly parallel
Each problem has a data set of ~20 KB

Taking this whole problem would require >30 GB of memory, so it's clear that you will need to split the set of problems into batches. With your 4 GB card (~3.5 GB usable with ECC on etc.) you can fit about 150,000 problems at any time. If you were to double buffer these to allow concurrent transfer of the next batch with the computation of the current batch, then you're looking at 75K problems in a batch (maybe fewer if you need space for output etc.).

The first important thing to consider is whether you can parallelise each problem, i.e. is there a way to assign multiple threads to a single problem? If so then you should look at assigning a block of threads to solve an individual problem, using shared memory may be worth considering, although you would be limiting your occupancy to two blocks per SM which may hurt performance.

If you cannot parallelise within a problem, then you should not be considering shared memory since, as you say, you would be limiting yourself to two threads per SM (fundamentally eliminating the benefit of GPU computing). Instead you would need to ensure that the data layout in global memory is such that you can achieve coalesced accesses - this most likely means using an SoA (struct of arrays) layout instead of AoS (array of structs).

Your second question is a little confusing, it's not clear if you mean "block" in the GPU context or in the problem context. However fundamentally if you declare a __shared__ array of 20 KB in your kernel code then that array will be allocated once per block and each block will have the same base-address.

Update following OP's comments

The GPU contains a number of SMs, and each SM has a small physical memory which is used both for the L1 and shared memory. In your case, K10, each SM has 48 KB available for use as shared memory, meaning that all the blocks executing on the SM at any time can use up to 48 KB between them. Since you need 20 KB per block, you can have a maximum of two blocks executing on the SM at any time. This doesn't affect how many blocks you can set in your launch configuration, it merely affects how they are scheduled. This answer talks in a bit more detail (albeit for a device with 16 KB per SM) and this (very old) answer explains a little more, although probably the most helpful (and up-to-date) info is on the CUDA education pages.

Thank you for the quick respond Tom... and sorry i was not clear in putting up my query.... Yes the problem can be parallelise in partial... and where ever possible(as per my knowledge) i have parallelised using the 96 threads which i have used per blocks..... You have suggested concurrent transfer of the next batch i think i can use the remaining Device which i have (i have 8 devices in all to be used i can use multi-GPU approach which i will see later)... Yes definitely performance with regard to timing is my main concern. — Amit, May 13 '15 at 10:26
My second confusion is since the device query shows "total shared memory is 48KB per block" Does this means that i can have any number of blocks in any particular launch configuration with every block size with the limit 48 KB? or if for e.g. the launch configuration is Problem<<<100, 96>>>( ... ) with shared memory of 20KB each in the kernel, will it require 2000KB in all? — Amit, May 13 '15 at 10:28
Thank you Tom ... your link This answer and this (very old) answer did helped me to clear my confusion.... — Amit, May 14 '15 at 06:45

How many blocks can be allocated if i use shared memory?

1 Answers1