- I have a very large array with
N0
elements. - Each thread will loop over and operate on
m
elements. - I have fixed
TBP
threads per block. - CUDA constrains blocks per grid
BPG < 65535 =: BPG_max
Now, let's downsize and consider an array of N0 = 90
elements with TBP = 32
.
- I could fire off
3 blocks of 32 threads each looping once (m = 1)
which means3 x 32 x 1 = 96
elements could have been operated on - i.e. wastage of 6. - Or I could fire off
2 blocks of 32 with m = 2
which means2 x 32 x 2 = 128
elements could have been operated upon, which is a wastage of 38.
With large arrays (100MB+) and lots of loops (10,000+), the factors get bigger and so the wastage can get very large, so how do I minimize wastage? That is, I'd like a procedure to optimize (where N
denotes actual work done):