CUDA best memory access layouts: global memory coalescence and shared memory bank conflicts

Question

I have some doubts regarding the most convenient global and shared memory access layouts in CUDA.

GLOBAL MEMORY

How the following memory addresses (0,0), (0,1), (1,0) and (1,1) are arranged in CPU memory and GPU memory? In other words, what is the order of in which they are stored?
Which is the row index and which the column index in (m, n) ?
Is global memory coalescence achieved by accessing elements in column major order or row major order ?

SHARED MEMORY

How do bank conflicts arise or not arise? Please let me know using examples/cases.
What is the command to configure shared memory and L1 out of total 64K and where to locate that command?

One question per question please: http://stackoverflow.com/faq — Paul R, Oct 23 '12 at 05:46
Ok, but actually I was narrowing down my description of problem for being specific, and all those are sub-questions of concept; so thought to write in that way.. — robot, Oct 23 '12 at 05:49
For two-dimensional arrays in C: The first index is the row index, the second index is the column index. Elements in adjacent columns are stored in adjacent locations in memory. For a 2-dimensional array of threads in CUDA, adjacent threads in x will be grouped into warps. Therefore, to achieve coalesced access, we may want to access a 2-dimensional array as C[threadIdx.y][threadIdx.x] or similar. For a multidimensional example refer to the first example I posted in [this SO question](http://stackoverflow.com/questions/12924155/sending-3d-array-to-cuda-kernel). It should coalesce. — Robert Crovella, Oct 23 '12 at 06:28
for the shared memory, you may be interested in [this webinar](http://developer.download.nvidia.com/CUDA/training/sharedmemoryusage_july2011.mp4) or slides 35-44 from [here](http://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Computing_Webinars_CUDA_Memory_Optimization.pdf). For the shared memory configure command, refer to the compute architecture sections of the C programming guide (example: [CC2.0](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-2-x)). The actual cuda runtime function is cudaFuncSetCacheConfig, you can just google that also. — Robert Crovella, Oct 23 '12 at 15:32
The documentation for shared memory is in the CUDA C Programming Guide. Links to section for compute capability [2.x](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-2-x) and [3.x](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-3-0) devices. The [CUDA Shared Memory](http://developer.download.nvidia.com/CUDA/training/sharedmemoryusage_july2011.mp4) webinar may also be helpful. — Greg Smith, Oct 23 '12 at 15:32
-thanks.i have partially understood. My doubts for points 2,3 & 4 in A are getting cleared.But still I am not able to understand concepts about point 1. a)How is the memory layout of data? is it (0,0),(0,1),(1,0),(1,1) or it is(0,0),(1,0),(0,1)and (1,1)? b)What is the index of element stored in matrix at 2nd row and 3rd column?(considering we are starting from row 0 and element(0,0) B)I am reading the links sent by you and trying to understand. — robot, Oct 24 '12 at 05:22
(0,0) = row zero, colum zero. (0,1) = row zero, column one. Adjacent columns (in the same row) are stored in adjacent memory locations. Columns zero and one are adjacent. Therefore these 2 elements will be adjacent. If we extend this, we see that the order in memory is (0,0),(0,1),(1,0),(1,1) This is a characteristic of C or C++, not CUDA. CUDA C or C++ behaves the same way as C or C++ in this respect. — Robert Crovella, Oct 24 '12 at 16:32
@RobertCrovella: thanks for the information. A)So I understood that for a given matrix |1 2 3| |4 5 6| |7 8 9| element 1 is at location-1 of memory(0,0), element 2 at location-2 of memory (0,1) and they are adjacent in memory. Is my understanding right? B)Going further elements in memory are located in this way 1,2,3,4,5,6,7,8,9. Am I right? — robot, Oct 26 '12 at 01:23
I don't understand what you mean by memory(0,0) or memory(0,1). I thought you were using the parenthesis to refer to matrix subscript notation. I don't follow you. Perhaps it's best if you read up on [arrays in C](http://www.mycplus.com/tutorials/c-programming-tutorials/arrays/). You can google this also, there are many resources available. — Robert Crovella, Oct 26 '12 at 02:47
@RobertCrovella: yah, actually it is the element of matrix. Actually, I was clarifying my thought about layout of elements of matrix in memory, but I haven't presented properly. But I understood it. I will surely go through the link. — robot, Oct 26 '12 at 20:04

score 4 · Answer 1 · answered Aug 14 '14 at 21:35

Much part of your question has been already answered in the comments above. I just want to provide some rules that can be useful to you and in general to next users concerning coalesced memory accesses, some examples on shared memory bank conflicts and some rules on avoiding shared memory bank conflicts.

COALESCED MEMORY ACCESSES

1D array - 1D thread grid

gmem[blockDim.x * blockIdx.x + threadIdx.x]

2D array - 2D thread grid

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int elementPitch = blockDim.x * gridDim.x;
gmem[y][x] or gmem[y * elementPitch + x]

SHARED MEMORY BANK CONFLICTS

To achieve high bandwidth, shared memory is divided into independent banks. In this way, shared memory can serve simultaneous accesses by the threads. Each Streaming Multiprocessor (SM) has shared memory organized in 32 memory banks. Each bank has a bandwidth of 32 bits per two clock cycles and hosts words of four bytes (32 bits): successive 32-bit word addresses are assigned to successive banks.

A bank conflict occurs when two different threads access different words in the same bank. Bank conflicts adversely impact performance since they enforce the hardware to serialize the access to shared memory. Note that there is no conflict if different threads access any bytes within the same word. Note also that there is no bank conflicts between threads belonging to different warps.

Fast accesses

If all threads of a warp access different banks, there is no bank conflict;
If all threads of a warp access an identical address for a fetch operation, there is no bank conflict (broadcast).

Slow accesses

32 threads access 32 different words in the same bank, so that all the accesses are serialized;
Generally speaking, the cost of accessing shared memory is proportional to the maximum number of simultaneous accesses to a single bank.

Example 1

smem[4]:   accesses bank #4  (physically, the fifth one – first row)

smem[31]:  accesses bank #31 (physically, the last one  – first row)

smem[50]:  accesses bank #18 (physically, the 19th one  – second row)

smem[128]: accesses bank #0  (physically, the first one – fifth row)

smem[178]: accesses bank #18 (physically, the 19th one  – sixth row)

If the third thread in a warp accesses myShMem[50] and the eight thread in the warp access myShMem[178], then you have a two-way bank conflict and the two transactions get serialized.

Example 2

Consider the following type of accesses

__shared__ float smem[256];
smem[b + s * threadIdx.x]

To have a bank conflict between two threads t1 and t2 of the same warp, the following conditions must hold

b + s * t2 = b + s * t1 + 32 * k, with k positive integer
0 <= t2 - t1 < 32

The above mean

32 * k = s * (t2 - t1)
0 <= t2 - t1 < 32

These two conditions do not hold true, namely no bank conflict, if s is odd.

Example 3

From Example 2, the following access

smem[b + threadIdx.x]

leads to no conflicts if smem is of a 32-bits data type. But also

extern __shared__ char smem[];
foo = smem[baseIndex + threadIdx.x];

and

extern __shared__ short smem[];
foo = smem[baseIndex + threadIdx.x];

lead to no bank conflicts, since one byte/thread is accessed and so different bytes of the same word are accessed.

CUDA best memory access layouts: global memory coalescence and shared memory bank conflicts

1 Answers1