Much part of your question has been already answered in the comments above. I just want to provide some rules that can be useful to you and in general to next users concerning coalesced memory accesses, some examples on shared memory bank conflicts and some rules on avoiding shared memory bank conflicts.
COALESCED MEMORY ACCESSES
1D array - 1D thread grid
gmem[blockDim.x * blockIdx.x + threadIdx.x]
2D array - 2D thread grid
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int elementPitch = blockDim.x * gridDim.x;
gmem[y][x] or gmem[y * elementPitch + x]
SHARED MEMORY BANK CONFLICTS
To achieve high bandwidth, shared memory is divided into independent banks. In this way, shared memory can serve simultaneous accesses by the threads. Each Streaming Multiprocessor (SM) has shared memory organized in 32
memory banks. Each bank has a bandwidth of 32
bits per two clock cycles and hosts words of four bytes (32
bits): successive 32
-bit word addresses are assigned to successive banks.
A bank conflict occurs when two different threads access different words in the same bank. Bank conflicts adversely impact performance since they enforce the hardware to serialize the access to shared memory. Note that there is no conflict if different threads access any bytes within the same word. Note also that there is no bank conflicts between
threads belonging to different warps.
Fast accesses
- If all threads of a warp access different banks, there is no bank conflict;
- If all threads of a warp access an identical address for a fetch operation, there is no
bank conflict (broadcast).
Slow accesses
32
threads access 32
different words in the same bank, so that all the accesses are serialized;
- Generally speaking, the cost of accessing shared memory is proportional to the maximum number of simultaneous accesses to a single bank.
Example 1
smem[4]: accesses bank #4 (physically, the fifth one – first row)
smem[31]: accesses bank #31 (physically, the last one – first row)
smem[50]: accesses bank #18 (physically, the 19th one – second row)
smem[128]: accesses bank #0 (physically, the first one – fifth row)
smem[178]: accesses bank #18 (physically, the 19th one – sixth row)
If the third thread in a warp accesses myShMem[50]
and the eight thread in the warp access myShMem[178]
, then you have a two-way bank conflict and the two transactions get serialized.
Example 2
Consider the following type of accesses
__shared__ float smem[256];
smem[b + s * threadIdx.x]
To have a bank conflict between two threads t1
and t2
of the same warp, the following conditions must hold
b + s * t2 = b + s * t1 + 32 * k, with k positive integer
0 <= t2 - t1 < 32
The above mean
32 * k = s * (t2 - t1)
0 <= t2 - t1 < 32
These two conditions do not hold true, namely no bank conflict, if s
is odd.
Example 3
From Example 2, the following access
smem[b + threadIdx.x]
leads to no conflicts if smem
is of a 32
-bits data type. But also
extern __shared__ char smem[];
foo = smem[baseIndex + threadIdx.x];
and
extern __shared__ short smem[];
foo = smem[baseIndex + threadIdx.x];
lead to no bank conflicts, since one byte/thread is accessed and so different bytes of the same word are accessed.