I have a kernel that, for each thread in a given block, computes a for loop with a different number of iterations. I use a buffer of size N_BLOCKS to store the number of iterations required for each block. Hence, each thread in a given block must know the number of iterations specific to its block.
However, I'm not sure which way is the best (performance speaking) to read the value and distribute it to all the other threads. I see only one good way (please tell me if there is something better): store the value in shared memory and have each thread read it. For example:
__global__ void foo( int* nIterBuf )
{
__shared__ int nIter;
if( threadIdx.x == 0 )
nIter = nIterBuf[blockIdx.x];
__syncthreads();
for( int i=0; i < nIter; i++ )
...
}
Any other better solutions? My app will use a lot of data, so I want the best performance.
Thanks!