9

I am trying to understand the basic architecture of a GPU. I have gone through a lot of material including this very good SO answer. But I am still confused not able to get a good picture of it.

My Understanding:

  • A GPU contains two or more Streaming Multiprocessors (SM) depending upon the compute capablity value.
  • Each SM consists of Streaming Processors (SP) which are actually responisible for the execution of instructions.
  • Each block is processed by SP in form of warps (32 threads).
  • Each block has access to a shared memory. A different block cannot access the data of some other block's shared memory.

Confusion:

In the following image, I am not able to understand which one is the Streaming Multiprocessor (SM) and which one is SP. I think that Multiprocessor-1 respresent a single SM and Processor-1 (upto M) respresent a single SP. But I am not sure about this because I can see that each Processor (in blue color) has been provided a Register but as far as I know, a register is provided to a thread unit.

It would be very helpful to me if you could provide some basic overview w.r.t this image or any other image.

image

AkaZecik
  • 980
  • 2
  • 11
  • 16
user2756695
  • 676
  • 1
  • 7
  • 22
  • 1
    Not really an "answer-grade" response but few answer-ish comments: 1. Number of SMs per GPU depends on GPU model, not compute capability. 2. Thread block is assigned to SM, not SP. 3. Orange boxes are SMs (as they are labeled). Each SM has shared memory pool, divided between thread blocks running on this SM. 4. Blue boxes are SPs. SP is a scalar lane and runs one thread. Each thread is provided with a set of registers, as shown on the diagram. – void_ptr Aug 26 '15 at 16:21
  • @void_ptr if you want to write up an answer like that, I would upvote it. – Robert Crovella Aug 26 '15 at 20:26
  • I can't tell, but you might be confused about SP. SP is a stream processor, and processes a single thread. You have some number of these on a SM, and each SP runs one thread of a warp. The warp exists at the SM level, individual threads exist at the SP level. – Patrick87 Aug 26 '15 at 20:46

1 Answers1

9

First, some comments on the "My understanding" portion of the question:

  • The number of SMs depends on GPU model - there are low-end models with just one SM, and high-end ones with as many as 30! Compute capability defines what those SMs are capable of, but not how many SMs there are in a GPU.
  • Each thread block is assigned to an SM, not SP. There can be multiple thread blocks running on a given SM, subject to its resource limitations.

On to the diagram:

  • Orange boxes are indeed SMs, just as they are labeled. Each SM has shared memory pool, divided between all thread blocks running on this SM.
  • Blue boxes are SPs. Since SP is a scalar lane, it runs one thread, and each thread is provided with its own set of registers, again, just like the diagram shows.

Addressing the follow-up question:

  • Each SM can have multiple resident thread blocks. The maximum number of thread blocks resident on SM is determined by compute capability. Achieved number can be lower than maximum when it is limited by the number of registers or the amount of shared memory consumed by each thread block.
  • SM will then schedule instruction from all warps resident on it, picking among warps that have instructions ready for execution - and those warps may come from any thread block resident on this SM. You generally want to have many warps resident, so that at any given moment of time SPs can be kept busy running instructions from whatever warps are ready.
  • Number of cores per SM is not a very useful metric, and you need not think too much about it at this point.
void_ptr
  • 618
  • 5
  • 15
  • thank you for your answer. My confusion has solved. But I am still not sure about one thing. Let's say, I have an image of size `512 x 512` and I use single dimesional grid to process this image. Let's say, `nThread = 512` and hence `nBlocks = 512`. So, `nWarpsPerBlock = 16`. If my GPU have 2 SMs and 192 cores each SM then, how would the actual processing take place? Will `192/32 = 6` warps of a single block be assigned to each SM at the same time? Can the warps of different blocks get assigned to SMs? Is it possible that the same SM is processing the warps of different blocks? – user2756695 Aug 27 '15 at 10:36
  • Answer updated to address those additional questions. – void_ptr Aug 27 '15 at 17:51
  • Also see this great answer with more details: http://stackoverflow.com/questions/23211781/stream-multiprocessor-core-per-streamprocessor-in-cuda?rq=1 – void_ptr Aug 27 '15 at 18:33