My Cuda program gains a significant performance boost (on average) depending on the size of the blocks & # of blocks; where the total number of "threads" remains the same. (I'm not sure if thread is the right terminology... but I'm going to use it here; where for each kernel the total number of threads is (# of blocks)*(block size)). I made some graphs to illustrate my point.
But first allow me to explain what my algorithm is first, however I'm not sure how relevant it is, because I would imagine this is something that applies to all GPGPU programs.But maybe I am wrong about that.
Basically I go across large arrays that are logically treated as 2D arrays, where each thread adds an element from the array as well as adds the square of that value to another variable and then at the end writes the value to another array, where during each read all the threads are shifted a certain way. Here is my kernel code:
__global__ void MoveoutAndStackCuda(const float* __restrict__ prestackTraces, float* __restrict__ stackTracesOut,
float* __restrict__ powerTracesOut, const int* __restrict__ sampleShift,
const unsigned int samplesPerT, const unsigned int readIns,
const unsigned int readWidth, const unsigned int defaultOffset) {
unsigned int globalId = ((blockIdx.x * blockDim.x) + threadIdx.x); // Global ID of this thread, starting from 0 to total # of threads
unsigned int jobNum = (globalId / readWidth); // Which array within the overall program this thread works on
unsigned int readIndex = (globalId % readWidth) + defaultOffset; // Which sample within the array this thread works on
globalId = (jobNum * samplesPerT) + readIndex; // Incorperate default offset (since default offset will also be the offset of
// index we will be writing to), actual globalID only needed for above two variables.
float stackF = 0.0;
float powerF = 0.0;
for (unsigned int x = 0; x < readIns; x++) {
unsigned int indexRead = x + (jobNum * readIns);
float value = prestackTraces[readIndex + (x * samplesPerT) + sampleShift[indexRead]];
stackF += value;
powerF += (value * value);
}
stackTracesOut[globalId] = stackF;
powerTracesOut[globalId] = powerF;
}
Now for the meat of this post, when calling this code
MoveoutAndStackCuda<<<threadGroups, threadsPerGroup>>>(*prestackTracesCudaPtr,
*stackTracesOutCudaPtr, *powerTracesOutCudaPtr,
*sampleShiftCudaPtr, samplesPerT, readIns,
readWidth, defaultOffset);
All I did was differ threadGroups and threadsPerGroup inside the <<<>>>, where threadGroups.x * threadsPerGroup.x remains the same. (As stated before this is a 1 dimensional problem).
I incremented the block size by 64 until I reached 1024. I expected no change, because I figured as long as block size is greater than 32, which I believe is the # of ALUs in a core, it would run as fast as possible. Take a look at this graph I made:
For this specific size the total number of threads is 5000 * 5120, so for example if the block size is 64 then there are ((5000 * 5120) / 64) blocks. For some reason there is a significant performance boost at a block size of 896, 768, and 512. Why?
I know this looks random, but each point in this graph is 50 test averaged together!
Here is another graph, this time for when the total # of threads will be (8000 * 8192). This time the boost is at 768 and 960.
Yet another example, this time for a job that is smaller than the other two problems (total threads is 2000 * 2048):
In fact here is an album I made of these graphs, with each graph representing a different size of the problem: graph album.
I am running this one a Quadro M5000, which has 2048 Cuda Cores. I believe each Cuda Core has 32 ALUs, so I presume that total # of computations that could be happening at any given time is (2048 * 32)?
So what explains these magic numbers? I figured it might be the total # of threads divided by the # of cuda cores, or divided by (2048 * 32), but so far I have found no correlation with anything that stretches across all of the graphs in my album. Is there another test I could do to help narrow things down? I want to find out what block size to run this program at for the best results.
Also I didn't include it, but I also did a test where block size decreased by 1 from 32 and things got exponentially slower. This makes sense to me since then we have less local threads per group than ALUs in a given multiprocessor.