2

First of all, here is the link to the algorithm:

GPU Gems 3, Chapter 39: Parallel Prefix Sum (Scan) with CUDA.

In order to avoid bank conflicts, padding is added to the shared memory array every NUM_BANKS (i.e., 32 for devices of computability 2.x) elements. This is done by (as in Figure 39-5):

int ai = offset*(2*thid+1)-1
int bi = offset*(2*thid+2)-1
ai += ai/NUM_BANKS
bi += ai/NUM_BANKS
temp[bi] += temp[ai]

I don't understand how ai/NUM_BANKS is equivalent to the macro:

   #define NUM_BANKS 16  
   #define LOG_NUM_BANKS 4  
   #define CONFLICT_FREE_OFFSET(n) \  
          ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))  

Isn't it equal to

n >> LOG_NUM_BANKS

Any help is appreciated. Thanks

fospathi
  • 537
  • 1
  • 6
  • 7
user11869
  • 1,083
  • 2
  • 14
  • 29

1 Answers1

13

I wrote that code and co-wrote the article, and I request that you use the article only for learning about scan algorithms, and do not use the code in it. It was written when CUDA was new, and I was new to CUDA. If you use a modern implementation of scan in CUDA you don't need any bank conflict avoidance.

If you want to do scans the easy way, use thrust::inclusive_scan or thrust::exclusive_scan.

If you really want to implement a scan, refer to more recent articles such as this one [1]. Or for a real opus with faster code but that will require a bit more study, this one [2]. Or read Sean Baxter's tutorial (though the latter doesn't include citations of the seminal work on the scan algorithm).

[1] Shubhabrata Sengupta, Mark Harris, Michael Garland, and John D. Owens. "Efficient Parallel Scan Algorithms for many-core GPUs". In Jakub Kurzak, David A. Bader, and Jack Dongarra, editors, Scientific Computing with Multicore and Accelerators, Chapman & Hall/CRC Computational Science, chapter 19, pages 413–442. Taylor & Francis, January 2011.

[2] Merrill, D. and Grimshaw, A. Parallel Scan for Stream Architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia. Dec. 2009.

Beau
  • 11,267
  • 8
  • 44
  • 37
harrism
  • 26,505
  • 2
  • 57
  • 88
  • I've been doing a bit of research on this as well, and if anyone was interested why, see: G.5.3 in cuda c-programming guide: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-5-x According to that section, as long as your device has compute capability 5.x or above, two threads accessing the same bank will not generate a bank conflict ( in a simple case as in scan example mentioned in question) , although multiple threads writing will as expected generate un-defined behaviour. – Liang May 09 '16 at 04:24