0

Say block 0 uses

__shared__ int a[VERY_LARGE];
__shared__ char b[VERY_LARGE];

block 1 will use

__shared__ double c[VERY_LARGE];
__shared__ int d[VERY_LARGE];

Say a and b will occupy all 48KB shared memory in one SM.

c and d will occupy all 48KB shared memory in one SM.

Is it possible to allocate a and b in one SM, c and d in the other SM?

yidiyidawu
  • 303
  • 1
  • 3
  • 12
  • 2
    Assigning thread blocks to Streaming Multiprocessors is a scheduler's, and not a programmer's, job. So you will have no guarantee that the scheduler will decide to run blocks `0` and `1` on different streaming multiprocessors. Be warned also that, to get the best performance out of a GPU, _roughly speaking_ all the threads should execute the same instruction "at the same time". To achieve what you are saying in your post you should have conditional shared memory allocation which makes me think that you will end up having other conditional statements. This may impact performance. – Vitality May 23 '14 at 07:22
  • @JackOLantern: That would be a perfect answer to a question whose short answer is "NO". Could you add it? I make sure it is upvoted to get this off the unanswered list. – talonmies May 23 '14 at 10:08

1 Answers1

1

Short answer: NO.

Assigning thread blocks to Streaming Multiprocessors is a scheduler's, and not a programmer's, job. So you will have no guarantee that the scheduler will decide to run blocks 0 and 1 on different Streaming Multiprocessors. This Stack Overflow thread

How CUDA Blocks/Warps/Threads map onto CUDA Cores?

will be helpful to you to understand. Also the whitepaper

NVIDIA’s Next Generation CUDA Compute Architecture: Fermi

although related to Fermi, will give you a deeper insight.

Be warned also that, to get the best performance out of a GPU, roughly speaking all the threads should execute the same instruction "at the same time". To achieve what you are saying in your post you should have conditional shared memory allocation which makes me think that you will end up having other conditional statements. This may impact performance.

Community
  • 1
  • 1
Vitality
  • 20,705
  • 4
  • 108
  • 146
  • I understand that this is scheduler's job not programmer's. But programmer could allocate all 48KB shared memory to make sure that only one block runs on one SM right? – yidiyidawu May 23 '14 at 14:03
  • 1
    @yidiyidawu: As it happens, yes, if you launch a kernel that uses 48K of shared memory, it will only be able to run one block per SM. But, Why do you think it's important to be able to do this? The hardware enforces alignment restrictions that effectively deconflict blocks from one another, such that if your threads obey the well-documented rules concerning shared memory access, they will run concurrently (at the well-documented warp granularity) with no issues... – ArchaeaSoftware May 24 '14 at 13:00
  • @ArchaeaSoftware The reason is that the project(research) I am doing has to enforce parts of the data on chip all the time for security reasons. Also I need as much on chip storage as possible. So the only way I see it now is to enforce 1 block on 1 SM only. Communication between SMs has to go through global memory, that is not acceptable. – yidiyidawu May 26 '14 at 04:33
  • 1
    Sadly, the biggest on-chip storages on a GPU are the register files. There is little difference between the GPU memory model (inter-SM communications go through global memory via the chip-wide L2 cache) and the CPU memory model (inter-core communications go through system memory via the chip-wide L3 cache). Best of luck with your research. – ArchaeaSoftware May 27 '14 at 13:08