First of all, here is the link to the algorithm:
GPU Gems 3, Chapter 39: Parallel Prefix Sum (Scan) with CUDA.
In order to avoid bank conflicts, padding is added to the shared memory array every NUM_BANKS (i.e., 32 for devices of computability 2.x) elements. This is done by (as in Figure 39-5):
int ai = offset*(2*thid+1)-1
int bi = offset*(2*thid+2)-1
ai += ai/NUM_BANKS
bi += ai/NUM_BANKS
temp[bi] += temp[ai]
I don't understand how ai/NUM_BANKS is equivalent to the macro:
#define NUM_BANKS 16
#define LOG_NUM_BANKS 4
#define CONFLICT_FREE_OFFSET(n) \
((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))
Isn't it equal to
n >> LOG_NUM_BANKS
Any help is appreciated. Thanks