4

Better = faster.

I am asking in general, but consider a case when I have more "workers" than data -- is it better than last threads per each block will remain not used, or is it better to make last blocks per grid not used?

greenoldman
  • 16,895
  • 26
  • 119
  • 185
  • 2
    You can't give a generell answer of an optimal kernel launch configuration. It allways depends on the use of registers, shared memory etc. You can use [cuda occupancy calculator](https://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CDMQFjAA&url=http%3A%2F%2Fdeveloper.download.nvidia.com%2Fcompute%2Fcuda%2FCUDA_Occupancy_calculator.xls&ei=Bt8XUZGLGefg4QSeooCgCg&usg=AFQjCNG_VhxwvgVBBUZnnincdbTyvYLrKQ&bvm=bv.42080656,d.bGE) to see how a kernel configuration will use the capacity of a gpu. – hubs Feb 10 '13 at 18:03
  • Ok, I understand your point, but I would also understand what is involved, to finally, understand what to tweak. So for the beginner as me, let's say I have input array, and I have to produce output array, when each element is multiplied by 2. – greenoldman Feb 10 '13 at 19:32

1 Answers1

4

You should remember this fact that each 8 block runs on a SM (streaming multiprocessor). You can think of them as CPU cores. each block can run up to 1024 threads currently which are comparable to logical cores, for example the cores that current intel i series have, whether or not you use all of those threads, the rest of them will be wasted, because you are not using them and well no one else can. so for example if you have 8 SMs on your GPU you can assign 64 number of blocks but then you can't assign 1024 threads to each, because there is a limit on total number of threads per SM, for example 2048.( edited these based on the information that hubs gave)

Soroosh Bateni
  • 897
  • 9
  • 20
  • 2
    Each SM can run up to 8 block in parallel, if there won't be a limitation by the use of registers and shared memory. You would waste a lot of the gpu compute power, when you can run 8 blocks with 512 threads instead, but you only want to run one block. – hubs Feb 10 '13 at 18:10
  • yes I see your point, comparing 1024 threads and 4 blocks with 8 blocks and 512 threads assuming they are running on a single SM, which one is faster? or they are the same? – Soroosh Bateni Feb 10 '13 at 18:19
  • Thank you (+1), but I would like to understand it fully. Does it mean, that if I am all for raw power and nothing else matters -- I should do the trick, and pass computation for all possible block/threads, but in such way, that I use only `0` thread in `k mod 8 = 0` block? And leaving the rest idle? If I understand correctly I could achieve max. work per each SM and all SMs would be involved. – greenoldman Feb 10 '13 at 19:30
  • not 0 threads, at least one thread, according to CUDA by Example, using threads has the advantage of using shared memory and probably some other things like the ability to synchronize between them, but technically yes, the multiplication of threads and blocks is what matters on a single SM. Edit: also nothing will be idle, the total number of threads you can run on a single SM in order to fully utilize it is 2048, so you can have 2 blocks and 1024 thread each, or you can have 4 blocks 512 each and so on. – Soroosh Bateni Feb 10 '13 at 20:42
  • Ok, so I conclude (and I hope this is right), that I should to divide the job among SM's, and not threads per se, because I could end up with with one SM running all the threads. – greenoldman Feb 10 '13 at 21:27
  • Yes by distributing your work among different SMs you will get better performance, in fact the third generation SM contains 32 CUDA cores, so technically it can run 32 CUDA instructions in a single cycle, So you are better off distributing your work among several SMs, In order to do that you have to take care of number of threads and blocks, for example number of threads should be a multiple of 32. you can further read [here](http://stackoverflow.com/a/10467342/783868) where Greg Smith describes it a lot better than I can. Also read the references he provided in order to get even better view. – Soroosh Bateni Feb 10 '13 at 21:58