0

In a single GPU such as P100 there are 56 SMs(Streaming Multiprocessors), and different SMs may have little correlation .I would like to know the application performance variation with different SMs.So it there any way to disable some SMs for a certain GPU. I know CPU offer the corresponding mechanisms but have get a good one for GPU yet.Thanks!

foxspy
  • 61
  • 3
  • The simple idea for me is really easy to understand, for example if we implement a matrix multiplication on GPU, the task mapping is implement by kernel and CUDA runtime, and we not care about the number of SM. And now I would like to know if we run the same matrix multiplication on 5,10,15,20,25,30... SMs for a certain GPU,how long will the application cost? Will the application performance arise with the SMs numbers(device computation power) used? – foxspy Dec 25 '17 at 13:48
  • 1
    You could approximate the effect by replacing the block scheduler with your own implementation based on atomic operations and then not schedule blocks on certain SMs based on SM id from the [PTX %smid special register](http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-smid). Note however this will not perform as well as the built-in scheduler and use up extra registers. I used to run my own block scheduler on SM 1.x devices where this was more efficient that the built-in scheduler. – tera Dec 25 '17 at 17:11
  • Thanks for your kindly help.And I assume this solution need me to modify my application. But I would like to test the performance transparently. Since if I modify the application, it is not clear if the performance change for the modification rather than totally by the limit on computing power. – foxspy Dec 27 '17 at 02:58
  • Can you explain why you'd like to know what the performance is with different SMs? Normally, you use all the SMs. Why would this be interesting for you? – einpoklum Dec 27 '17 at 09:23
  • 1
    You would have to run the original application, the modified application with all SMs enabled, and the modified application with fewer SMs. Then draw your conclusions from all three results. – tera Dec 27 '17 at 10:21
  • 1
    Thinking about it, instead of running your own block scheduler it is probably simpler to block out some SMs by running a separate kernel in another stream with a few blocks in timed loops longer than the actual kernel runs that each request the ressources of a whole SM. That way you at least don't have to modify the kernel you want to benchmark. – tera Dec 27 '17 at 10:27
  • Thanks for you reply. It seems that no way to disable SM with OS level solution. And now I make a test with the multiple concurrent stream and one stream filled with endless loop. The result is acceptable.Thanks! – foxspy Dec 30 '17 at 08:36

1 Answers1

4

There are no CUDA-provided methods to disable a SM (streaming multiprocessor). With varying degrees of difficulty and behavior, some possibilities exist to try this using indirect methods:

  1. Use CUDA MPS, and launch an application that "occupies" fully one or more SMs, by carefully controlling number of blocks launched and resource utilization of those blocks. With CUDA MPS, another application can run on the same GPU, and the kernels can run concurrently, assuming sufficient care is taken for it. This might allow for no direct modification of the application code under test (but an additional application launch is needed, as well as MPS). The kernel duration will need to be "long" so as to occupy the SMs while the application under test is running.

  2. In your application code, effectively re-create the behavior listed in item 1 above by launching the "dummy" kernel from the same application as the code under test, and have the dummy kernel "occupy" one or more SMs. The application under test can then launch the desired kernel. This should allow for kernel concurrency without MPS.

  3. In your application code, for the kernel under test itself, modify the kernel block scheduling behavior, probably using the smid special register via inline PTX, to cause the application kernel itself to only utilize certain SMs, effectively reducing the total number in use.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thanks for you reply. It seems that no way to disable SM with OS level solution. And now I make a test with the multiple concurrent stream and one stream filled with endless loop. The result is acceptable.Thanks! – foxspy Dec 30 '17 at 08:36
  • Can anyone share some example codes that follow the above guidelines? It would be very nice for me since I'm looking for a solution to disable/occupy some SMs for testing of a subset of SMs. – Minh Nguyen Mar 16 '21 at 05:51
  • 1
    [here](https://stackoverflow.com/questions/30361459/what-is-the-behavior-of-thread-block-scheduling-to-specific-sms-after-cuda-kern/30379454#30379454) is an example that has most of the plumbing needed – Robert Crovella Mar 16 '21 at 13:20
  • I will check it out, thank you very much for your help! – Minh Nguyen Mar 17 '21 at 02:47