GPGPU: Block size's effect on program performance, why does my program run faster at very specific sizes?

Question

My Cuda program gains a significant performance boost (on average) depending on the size of the blocks & # of blocks; where the total number of "threads" remains the same. (I'm not sure if thread is the right terminology... but I'm going to use it here; where for each kernel the total number of threads is (# of blocks)*(block size)). I made some graphs to illustrate my point.

But first allow me to explain what my algorithm is first, however I'm not sure how relevant it is, because I would imagine this is something that applies to all GPGPU programs.But maybe I am wrong about that.

Basically I go across large arrays that are logically treated as 2D arrays, where each thread adds an element from the array as well as adds the square of that value to another variable and then at the end writes the value to another array, where during each read all the threads are shifted a certain way. Here is my kernel code:

__global__ void MoveoutAndStackCuda(const float* __restrict__ prestackTraces, float* __restrict__ stackTracesOut,
  float* __restrict__ powerTracesOut, const int* __restrict__ sampleShift,
  const unsigned int samplesPerT, const unsigned int readIns,
  const unsigned int readWidth, const unsigned int defaultOffset) {

  unsigned int globalId = ((blockIdx.x * blockDim.x) + threadIdx.x); // Global ID of this thread, starting from 0 to total # of threads

  unsigned int jobNum = (globalId / readWidth); // Which array within the overall program this thread works on
  unsigned int readIndex = (globalId % readWidth) + defaultOffset; // Which sample within the array this thread works on

  globalId = (jobNum * samplesPerT) + readIndex;  // Incorperate default offset (since default offset will also be the offset of
                                                  // index we will be writing to), actual globalID only needed for above two variables.

  float stackF = 0.0;
  float powerF = 0.0;

  for (unsigned int x = 0; x < readIns; x++) {

    unsigned int indexRead = x + (jobNum * readIns);

    float value = prestackTraces[readIndex + (x * samplesPerT) + sampleShift[indexRead]];

    stackF += value;
    powerF += (value * value);
  }

  stackTracesOut[globalId] = stackF;
  powerTracesOut[globalId] = powerF;
}

Now for the meat of this post, when calling this code

  MoveoutAndStackCuda<<<threadGroups, threadsPerGroup>>>(*prestackTracesCudaPtr,
    *stackTracesOutCudaPtr, *powerTracesOutCudaPtr,
    *sampleShiftCudaPtr, samplesPerT, readIns,
    readWidth, defaultOffset);

All I did was differ threadGroups and threadsPerGroup inside the <<<>>>, where threadGroups.x * threadsPerGroup.x remains the same. (As stated before this is a 1 dimensional problem).

I incremented the block size by 64 until I reached 1024. I expected no change, because I figured as long as block size is greater than 32, which I believe is the # of ALUs in a core, it would run as fast as possible. Take a look at this graph I made:

For this specific size the total number of threads is 5000 * 5120, so for example if the block size is 64 then there are ((5000 * 5120) / 64) blocks. For some reason there is a significant performance boost at a block size of 896, 768, and 512. Why?

I know this looks random, but each point in this graph is 50 test averaged together!

Here is another graph, this time for when the total # of threads will be (8000 * 8192). This time the boost is at 768 and 960.

Yet another example, this time for a job that is smaller than the other two problems (total threads is 2000 * 2048):

In fact here is an album I made of these graphs, with each graph representing a different size of the problem: graph album.

I am running this one a Quadro M5000, which has 2048 Cuda Cores. I believe each Cuda Core has 32 ALUs, so I presume that total # of computations that could be happening at any given time is (2048 * 32)?

So what explains these magic numbers? I figured it might be the total # of threads divided by the # of cuda cores, or divided by (2048 * 32), but so far I have found no correlation with anything that stretches across all of the graphs in my album. Is there another test I could do to help narrow things down? I want to find out what block size to run this program at for the best results.

Also I didn't include it, but I also did a test where block size decreased by 1 from 32 and things got exponentially slower. This makes sense to me since then we have less local threads per group than ALUs in a given multiprocessor.

I haven't analyzed your question in detail, but I would encourage you to use CUDA's profiling tools (like NVIDIA Visual Profiler), as they are quite good. They can tell you exactly what part of your program is slower/faster for the various cases you've considered. — Jason R, Dec 31 '16 at 03:09

score 9 · Answer 1 · edited May 23 '17 at 12:16

Based on this statement:

I incremented the block size by 64 until I reached 1024. I expected no change, because I figured as long as block size is greater than 32, which I believe is the # of ALUs in a core, it would run as fast as possible.

I would say there is an important concept about GPUs that you probably are not aware of: GPUs are a "latency hiding" machine. They hide latency principally by having lots of available (parallel) work exposed to them. This can be roughly summarized as "lots of threads". It is a completely wrong idea with GPUs that once you have enough threads to cover the number of "cores" or execution units, that that is sufficient. It is not.

As a (beginner) GPU programmer you should mostly ignore the number of cores in your GPU. You want lots of threads. Both at the kernel level, and per GPU SM.

In general, as you provide more threads to each SM, then the ability of the GPU to hide latency while doing other useful work increases. This explains the general trend in all of your graphs that the slope is generally downward from left to right (i.e. average performance increases, generally, as you provide more exposed work to each SM).

This doesn't address the peaks and valleys, however. GPUs have a large number of architectural issues which may affect performance. I won't offer a complete treatment here. But let's take one case:

Why does performance in the first graph increase up to 512 threads, then suddenly decrease at 576 threads?

This is most likely an occupancy effect. An SM in your GPU has a maximum complement of 2048 threads. Based on the previous discussion, the SM will have the maximum ability to hide latency (and therefore generally deliver the maximum average performance) when we maximize the thread complement, up to 2048.

For a block size of 512 threads, we can fit exactly 4 of these threadblocks on an SM, and it will then have a complement of 2048 threads from which to choose for work and latency hiding.

But when you change the threadblock size to 576, 4*576 > 2048, so we can no longer fit 4 threadblocks on each SM. This means, for that kernel configuration, that each SM will run with 3 threadblocks, i.e. 1728 threads out of the 2048 possible. This is actually worse, from the SM's point of view, than the previous case which allowed 2048 threads, and so it may be an indicator of why performance decreases going from 512 to 576 threads (just as it increased from 448 to 512, which involves a similar change in instantaneous occupancy).

As a result of the above, it's not uncommon to see performance charts like the one you have shown, when we vary the threads per block.

Other occupancy limiters that have a granular (quantized) effect can result in similar peaky behavior in the performance graph. For example, there's not quite enough information in your question to surmise about registers-per-thread usage, but a limiter to occupancy may be registers used per thread. As you vary the thread complement, you will find that you may similarly have a changing complement of blocks resident per SM, which can result in varying occupancy (both up and down) and thus varying performance.

To delve into this further, I would suggest you spend some time learning about occupancy, registers per thread, and performance analysis capabilities of the various profilers. There is lots and lots of information on these topics available already; google is your friend, and note the question/answers linked in the comments above, as a reasonable starting point. To fully study occupancy and its effect on performance requires more information than what you've given here. It requires basically a MCVE and also the exact compile command line, as well as the platform you are running on, and the CUDA version. The compiler's registers-per-thread usage is affected by all of these things, most of which you haven't provided.

As a rule of thumb, the "optimal" total number of threads running on a modern GPU is *tens of thousands*. For memory-bound workloads it is ideally as high as #SMs * (max threads/SM) * 20. The factor of 20 is an empirically determined multiplier, it basically ensures that enough work is queued up at various places around the GPU that performance impact of stalls is minimized as there is always other work ready to be processed. — njuffa, Dec 31 '16 at 19:48
Thank you for your post. A lot of it was very useful. However your explanation on why, for example, 512 is fast because the cuda core count is divisible by it, doesn't explain the cases were 576 is significantly faster than 512. I will up vote your post but delay giving it the answer just in case others have a useful answer. — danglingPointer, Jan 02 '17 at 18:00
Indeed, my explanation does not explain every up and down in each graph. There is likely not one single explanation that covers them all, however **occupancy** is likely a factor in many or most. Fully understanding the occupancy effects at each data point requires considerably more information than what you have provided, as I stated already. — Robert Crovella, Jan 02 '17 at 18:05

GPGPU: Block size's effect on program performance, why does my program run faster at very specific sizes?

1 Answers1