The GK110 devices have, by default, the L1 cache disabled for ordinary global accesses. This means that global reads may be cached in L2 but not in L1. The L2 cache has a longer access latency than L1.
If your data is read-only, and the compiler is able to discover it or you assist the compiler with appropriate decoration of global pointer kernel parameters with const ... __restrict__ ...
, then the read-only cache may be used. If it is used, the access latency will be closer to L1 type latency for items which hit in the read-only cache, as opposed to L2 type latency for items that only hit in the L2 cache.
Caches generally only have an impact on code performance in the situation where there is data re-use. If your device code only reads from a particular global variable once, there is unlikely to be any cache benefit.
If you want to see a specific code example, take a look at the answer I provided here. When I remove the const __restrict__
qualifiers from the kernel parameters, I see a performance difference on K40c (and I documented the difference in my answer there).