In which specific scenario would read only data cache outperform global memory access?

Question

Alright, my question may be general, since I don't have a specific problem right now.

However, according to my past experiences, I never saw CUDA's read only data cache outperforms other types of memory accesses such as global memory or constant memory, at the best situation, read only data cache would just be as fast as direct non-coalesced global memory access, that makes feel I might done something wrong.

So my question is in what scenario would read only data cache faster than other types of memory accesses?

How did you measure the performance of read only cache accesses versus direct non-coalesced global memory accesses? Which cache are you talking about? Texture cache, constant cache, L1 / L2 cache since Fermi? — pQB, Mar 02 '15 at 10:26
Hi pQB, read-only data cache is a new cache introduced in NVIDIA's GK110 GPUs, you may refer to this link and see section 1.4.4.3: http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#read-only-data-cache. I measure the performance simply by taking acount the whole execution time or the kernel. — Xiangyu.Guo, Mar 02 '15 at 11:19

score 4 · Accepted Answer · edited May 23 '17 at 10:09

The GK110 devices have, by default, the L1 cache disabled for ordinary global accesses. This means that global reads may be cached in L2 but not in L1. The L2 cache has a longer access latency than L1.

If your data is read-only, and the compiler is able to discover it or you assist the compiler with appropriate decoration of global pointer kernel parameters with const ... __restrict__ ..., then the read-only cache may be used. If it is used, the access latency will be closer to L1 type latency for items which hit in the read-only cache, as opposed to L2 type latency for items that only hit in the L2 cache.

Caches generally only have an impact on code performance in the situation where there is data re-use. If your device code only reads from a particular global variable once, there is unlikely to be any cache benefit.

If you want to see a specific code example, take a look at the answer I provided here. When I remove the const __restrict__ qualifiers from the kernel parameters, I see a performance difference on K40c (and I documented the difference in my answer there).

Hi Robert, your answer is awesome. Just one day before you answered this, I happened run in the same situation as you described and see a significant performance improvement, which proves the correctness of your answer as well. Thanks! — Xiangyu.Guo, Mar 04 '15 at 11:34

In which specific scenario would read only data cache outperform global memory access?

1 Answers1