3

I have a kernel that, for each thread in a given block, computes a for loop with a different number of iterations. I use a buffer of size N_BLOCKS to store the number of iterations required for each block. Hence, each thread in a given block must know the number of iterations specific to its block.

However, I'm not sure which way is the best (performance speaking) to read the value and distribute it to all the other threads. I see only one good way (please tell me if there is something better): store the value in shared memory and have each thread read it. For example:

__global__ void foo( int* nIterBuf )
{
   __shared__ int nIter;

   if( threadIdx.x == 0 )
      nIter = nIterBuf[blockIdx.x];

   __syncthreads();

   for( int i=0; i < nIter; i++ )
      ...
} 

Any other better solutions? My app will use a lot of data, so I want the best performance.

Thanks!

Michael Eilers Smith
  • 8,466
  • 20
  • 71
  • 106
  • What if I copied the shared variable to a local variable? Yet, I have a feeling that if I only use it once in the for loop, it's not worth it. There is still the bank conflict in shared memory required for copying the value to local thread memory. – Michael Eilers Smith Dec 31 '11 at 17:38
  • Yes you could store in a register. Not sure if it's going to be really faster, plus it uses a register. And no there won't be any bank conflict there since all threads read from the same address. – user703016 Dec 31 '11 at 17:39
  • But the conflict will then only occur once per block, not once per iteration per block. – Kos Dec 31 '11 at 17:40
  • @Cicada's right, it won't, see http://stackoverflow.com/questions/4396191/gpu-shared-memory-bank-conflict – Kos Dec 31 '11 at 17:42
  • @Kos: I don't think so, because the compiler will optimize and store the value only once. – Michael Eilers Smith Dec 31 '11 at 17:42

2 Answers2

5

Read-only values that are uniform across all threads in a block are probably best stored in __constant__ arrays. On some CUDA architectures such as Fermi (SM 2.x), if you declare the array or pointer argument using the C++ const keyword AND you access it uniformly within the block (i.e. the index only depends on blockIdx, not threadIdx), then the compiler may automatically promote the reference to constant memory.

The advantage of constant memory is that it goes through a dedicated cache, so it doesn't pollute the L1, and if the amount of data you are accessing per block is relatively small, after the first access within each block, you should always hit in the cache after the initial compulsory miss in each thread block.

You also won't need to use any shared memory or transfer from global to shared memory.

harrism
  • 26,505
  • 2
  • 57
  • 88
2

If my info is up-to-date, the shared memory is the second fastest memory, second only to the registers.

If reading this data from shared memory every iteration slows you down and you still have registers available (refer to your GPU's compute capability and specs), you could perhaps try to store a copy of this value in every thread's register (using a local variable).

Kos
  • 70,399
  • 25
  • 169
  • 233
  • So, the best would be to read from global memory, store in shared memory, and then copy to a register. Or should I skip the shared memory? After all, there is no coalesced reading I believe. – Michael Eilers Smith Dec 31 '11 at 17:47
  • Either a) read from global to register or b) use the solution you posted. There *is* coalesced reading, check the CUDA manual. – user703016 Dec 31 '11 at 17:48
  • Actually, I think in this case __constant__ would be preferable to __shared__ -- simpler code, and no wastage of shared memory. See my answer. – harrism Jan 05 '12 at 06:25