Most efficient number of blocks to launch in CUDA?

Question

I have a very large array with N0 elements.
Each thread will loop over and operate on m elements.
I have fixed TBP threads per block.
CUDA constrains blocks per grid BPG < 65535 =: BPG_max

Now, let's downsize and consider an array of N0 = 90 elements with TBP = 32.

I could fire off 3 blocks of 32 threads each looping once (m = 1) which means 3 x 32 x 1 = 96 elements could have been operated on - i.e. wastage of 6.
Or I could fire off 2 blocks of 32 with m = 2 which means 2 x 32 x 2 = 128 elements could have been operated upon, which is a wastage of 38.

With large arrays (100MB+) and lots of loops (10,000+), the factors get bigger and so the wastage can get very large, so how do I minimize wastage? That is, I'd like a procedure to optimize (where N denotes actual work done):

enter image description here

If this is just a parallel map, just set `m` to `N / (block_size * max_grid_size)`. Otherwise, it's very difficult to know the best way to decompose the problem without full blown autotuning. — Jared Hoberock, May 09 '13 at 00:35
You might be interested in [this answer](http://stackoverflow.com/a/5810780/681865) to a rather similar question. — talonmies, May 09 '13 at 10:11

score 2 · Accepted Answer · answered May 09 '13 at 00:51

I would not be worried about "wasted" threads - GPU threads are lightweight.

You might actually want to increase the block size as this could increase the occupancy of your GPU. Note that SMX (in GeForce 6xx line) can only execute 16 concurrent blocks. Making blocks larger would allow you to schedule more threads to hide memory access latency.

Michael Ohlrogge · Answer 2 · 2016-06-18T15:27:59.153

Depending on what all you are doing in your kernels, the answer might or might not be as simple as the optimization problem you cite. E.g. if you are going to have issues of latency, threads waiting for each other to complete, etc. then there are more issues to consider.

This site has some great heuristics. Some general highlights:

Choosing Blocks Per Grid

Blocks per grid should be >= number of multiprocessors.
The more use of __syncthreads() in your kernels, the more blocks (so that one block can run while another waits to sync)

Choosing Threads Per Block

Threads in multiples of warp size (i.e. generally 32)
Generally good to choose number of threads such that max number of threads per block (based on hardware) is a multiple of number of threads. E.g. with max threads of 768, using 256 threads per block will tend to be better than 512 because multiple threads can run simultaneously on a block.
Think about whether your threads will share memory and if so, how many you'll want to have sharing.

score -1 · Answer 3 · edited May 01 '17 at 23:43

-1

That is in fact a quite complicated problem, and I doubt that there is a O(1) solution to it. But I'm guessing that you can afford some linear time on CPU to compute that minimum.

This is wolfram alpha's opinion.

edited May 01 '17 at 23:43

Michael Ohlrogge

10,559
5
48
76

answered May 09 '13 at 09:19

1

"What you should be minimizing is the work that each thread is doing"? That is about the single worst piece of CUDA optimisation advice I have *ever* seen.... – talonmies May 09 '13 at 09:25
Interesting, the last time i checked, the idea behind parallel computing is to have many processors doing little work individualy, not the other way around - that would be a CPU. You make them do a lot of work when you dont have a choice, but when you do have a choice it is not a good practice. If you make bald claims as you did now, give some arguments. – May 09 '13 at 09:33
2

OK - try this [slide deck](http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf) for starters. The CUDA block scheduling process is lightweight, but it isn't free. Further, there is a very large difference in memory throughput and arithmetic throughput in GPUs. On a current GK104 card there is about 3000 Glops single precision peak throughput (assuming a pure fmad instruction stream) and about 50 Gfloat/s of peak memory bandwidth. That means a given thread has to do *at least 30 fmad* per 32 bit word it reads or writes to be compute bound. – talonmies May 09 '13 at 09:44
2

It has been empirically demonstrated over and over again that "resident" threads which process many parallel units of work during their lifetime and instruction level parallelism are much more effective at achieving high throughput on CUDA devices than the alternative. All of the highest performing codes for common things like dense linear algebra, FFTs, etc have evolved towards this model of programming the GPU. – talonmies May 09 '13 at 09:49
Hmm, this is very intersting, and must admit pretty counter intuitive. Ive used cuda only with opengl for physics and ray-tracing and this lower occupancy - more work per thread philosophy could help me with optimisation. So, my mistake i edited my anwser. Cheers – May 09 '13 at 10:20
1

@Aloalo: you should also read the other slides of Vasily Volkov, like the one on [parallel loops unrolling](http://www.cs.berkeley.edu/~volkov/volkov11-unrolling.pdf). – BenC May 09 '13 at 10:20
@talonmies - so you are saying launch more asynchronous kernels with independent jobs rather than throwing lots of threads at one kernel at a time? – mchen May 16 '13 at 00:37
@MiloChen: No, I am saying nothing like that at all. – talonmies May 16 '13 at 04:55
Then clearly I am misunderstanding "'resident' threads which process many parallel units of work during their lifetime and instruction level parallelism are much more effective at achieving high throughput on CUDA devices than the alternative". Could you clarify? – mchen May 16 '13 at 08:04

Most efficient number of blocks to launch in CUDA?

3 Answers3