-1

I've read these two pages: Understanding Streaming Multiprocessors (SM) and Streaming Processors (SP), How Concurrent blocks can run a single GPU streaming multiprocessor? But I am still confusing about the hardware structure.

  1. Is SM a SIMT(single instruction multi thread) structure?

Suppose there are 8 SPs in a given SM. If different blocks can be executed in a same SM, these SPs will have different instructions. So my understanding is: SM will give different SP different instruction.

  1. Are the threads in a same warp executed simultaneously?

Suppose there are 8 SPs in a given SM. A warp is in the SM. Since several warps may run in the SM, I suppose 4 SPs are running this warp. There are 32 threads in this warp, but only 4 SPs can run them. So it will actually take 8 cycles to run this warp? I also heard someone said that all the threads in a warp run serially. I don't know what is the truth...

Community
  • 1
  • 1
飞 Soar
  • 1
  • 2
  • similar question [here](http://stackoverflow.com/questions/12212003/how-concurrent-blocks-can-run-a-single-gpu-streaming-multiprocessor) and [here](http://stackoverflow.com/questions/20771358/how-do-a-sm-in-cuda-run-multiple-blocks-simultaneously) – Robert Crovella Jul 22 '16 at 02:30

1 Answers1

-1

Several blocks can run in a single SM. According to this presentation (slide 19 - thanks @RobertCrovela), blocks can be emitted from different kernels. When running from the same kernel, block index can be seen as a supplemental level of thread index, up to some limit (different for each architecture and kernel). However, I have never experienced two different streams running at the same time on a given SM.

Depending on architecture, a single warp instruction may be run by SP in a single cycle, hence simultaneously. However, this can only be true for SM with 32 SP, thus not in double precision for example. Also, there is no guarantee on this. Finally we experienced configurations where, some threads of high warp index where running before lower indices. Besides synchronization functions and other tools, there is no hard rule on how the instruction scheduler behaves.

Florent DUGUET
  • 2,786
  • 16
  • 28
  • 2
    Take a look at slide 19 [here](http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf) - blocks from separate kernels can be resident on the same SM. It's probably more or less as hard to demonstrate as concurrent kernels. – Robert Crovella Jul 22 '16 at 02:33
  • @RobertCrovella, thanks for this update, I will change the answer according to your comment of different kernels. However, can you point me to an example illustrating this ? As said, I never experienced it. – Florent DUGUET Jul 22 '16 at 08:01
  • It's possible to demonstrate without much effort. If you want to ask a new question here on SO, requesting such a demonstration, I'll provide what I have. I'm not going to try and cover it in the space of a comment. No, I can't point you to a ready-made example somewhere on the web. – Robert Crovella Jul 23 '16 at 16:40