I am writing a program for school using shared memory. I got the function to work correctly, yet it only works with a single block per grid. If there is more than one block per grid then the function no longer works.
For example, If I send in an array of size 10, with 1 block and 10 threads this function works correctly. If I send in the same array with 2 blocks and 5 threads per block then it no longer works. When I say no longer works, I am getting the same array I sent in, back out, like nothing is happening.
__global__ void rankSortShared(int* a, int n)
{
int threadId = threadIdx.x + blockDim.x * blockIdx.x;
int x = 0;
// Make dynamic sized shared array
// size will be determined from the 3rd parameter in the
// kernal call 'ranksortShared<<<blocksPerGrid, threadsPerBlock, *(size of shared mem)*>>>()'
extern __shared__ int b[];
// copy data from incomming array to shared array
// then sync the threads so all threads have put their
// values into the shared array
b[threadId] = a[threadId];
__syncthreads();
// now use shared array for faster lookups
for (int j = 0; j < n; j++)
{
// handle duplicate values
if (b[j] < b[threadId] || (b[threadId] == b[j] && j < threadId))
{
x++;
}
}
// put data back into array to be transferred back to CPU
a[x] = b[threadId];
}
For the life of me I can not figure out why. Is there a special way to use shared memory across multiple blocks?
Just to try and debug this, I replaced all the code with a[threadId] = threadId
and the array was seemingly correct (printing the numbers 0 through 9). So then I do not understand why this wouldn't work if the threadId
s seem to be correct, and it works with one block.