Good memory use habits when accessing integer values in NVIDIA CUDA

Question

I am completely new to CUDA programming and I want to make sure I understand some basic memory related principles, as I was a little confused with some of my thoughts.

I am working on simulation, using billions of one time use random numbers in range of 0 to 50.

After cuRand fill a huge array with random 0.0 to 1.0 floats, I run a kernel that convert all this float data to desired integer range. From what I learned I had a feeling that storing 5 these values on one unsigned int by using just 6 bits is better because of very low bandwidth of global memory. So I did it.

Now I have to store somewhere around 20000 read-only yes/no values, that will be accessed randomly with let's say the same probability based on random values going to the simulation.

First I thought about shared memory. Using one bit looked great until I realized that the more information in one bank, the more collisions there will be. So the solution seems to be use of unsigned short (2Byte->40KB total) to represent one yes/no information, using maximum of available memory and so minimizing probabilities of reading the same bank by different threads.

Another thought came from using constant memory and L1 cache. Here, from what I learned, the approach would be exactly opposite from shared memory. Reading the same location by more threads is now desirable, so putting 32 values on one 4B bank is now optimal.

So based on overall probabilities I should decide between shared memory and cache, with shared location probably being better as with so many yes/no values and just tens or hundreds of threads per block there will not be many bank collisions.

But am I right with my general understanding of the problem? Are the processors really that fast compared to memory that optimizing memory accesses is crucial and thoughts about extra instructions when extracting data by operations like <<, >>, |=, &= is not important?

Sorry for a wall of text, there is million of ways I can make the simulation work, but just a few ways of making it the right way. I just don't want to repeat some stupid mistake again and again only because I understand something badly.

score 3 · Accepted Answer · edited May 23 '17 at 12:23

The first two optimization priorities for any CUDA programmer are:

Launch enough threads (i.e. expose enough parallelism) that the machine has sufficient work to be able to hide latency
Make efficient use of memory.

The second item above will have various guidelines based on the types of GPU memory you are using, such as:

Strive for coalesced access to global memory
Take advantage of caches and shared memory to optimize the reuse of data
For shared memory, strive for non-bank-conflicted access
For constant memory, strive for uniform access (all threads within a warp read from the same location, in a given access cycle)
Consider use of texture caching and other special GPU features.

Your question is focused on memory:

Are the processors really that fast compared to memory that optimizing memory accesses is crucial and thoughts about extra instructions when extracting data by operations like <<, >>, |=, &= is not important?

Optimizing memory access is usually quite important to have a code that runs fast on a GPU. Most of the above recommendations/guidelines I gave are not focused on the idea that you should pack multiple data items per int (or some other word quantity), however it's not out of the question to do so for a memory-bound code. Many programs on the GPU end up being memory-bound (i.e. their usage of memory is ultimately the performance limiter, not their usage of the compute resources on the GPU).

An example of a domain where developers are looking at packing multiple data items per word quantity is in the Deep Learning space, specifically with convolutional neural networks on GPUs. Developers have found, in some cases, that they can make their training codes run faster by:

store the data set as packed half (16-bit float) quantities in GPU memory
load the data set in packed form
unpack (and convert) each packed set of 2 half quantities to 2 float quantities
perform computations on the float quantities
convert the float results to half and pack them
store the packed quantities
repeat the process for the next training cycle

(See this answer for some additional background on the half datatype.)

The point is, that efficient use of memory is a high priority for making codes run fast on GPUs, and this may include using packed quantities in some cases.

Whether or not it's sensible for your case is probably something you would have to benchmark and compare to determine the effect on your code.

Some other comments reading your question:

Here, from what I learned, the approach would be exactly opposite from shared memory.

In some ways, the use of shared memory and constant memory are "opposite" from an access-pattern standpoint. Efficient use of constant memory involves uniform access as already stated. In the case of uniform access, shared memory should also perform pretty well, because it has a broadcast feature. You are right to be concerned about bank-conflicts in shared memory, but shared memory on newer GPUs have a broadcast feature that negates the bank-conflict issue when the accesses are taking place from the same actual location as opposed to just the same bank. These nuances are covered in many other questions about shared memory on SO so I'll not go into it further here.
Since you are packing 5 integer quantities (0..50) into a single int, you might also consider (maybe you are doing this already) having each thread perform the computations for 5 quantities. This would eliminate the need to share that data across multiple threads, or have multiple threads read the same location. If you're not doing this already, there may be some efficiencies in having each thread perform the computation for multiple data points.
If you wanted to consider switching to storing 4 packed quantities (per int) instead of 5, there might be some opportunities to use the SIMD instrinsics, depending on what exactly you are doing with those packed quantities.

Just to add, I would like to remind that one of the most performance degrading things you can do on a GPU is branching. Avoid! — user3427419, Jan 04 '16 at 20:33

Good memory use habits when accessing integer values in NVIDIA CUDA

1 Answers1