3

If a thread is accessing global memory, why does it access a large chunk? Where is this large chunk stored?

If your reading from global memory in a coalesced manner, would it be beneficial to copy a common chunk of the global memory into shared memory, or would there not be any improvement.

ie: If each thread is reading the next 5 or 10 or 100 memory locations, and averaging them, if you could fit a chunk of X points from global memory into shared memory, could you not write an if statement saying if you looking for one of these memory values, read from shared memory rather than global? Im assuming the warp divergence penalty would be less than reading from global memory each time.

enter image description here

paleonix
  • 2,293
  • 1
  • 13
  • 29
Hans Rudel
  • 3,433
  • 5
  • 39
  • 62

1 Answers1

5

When you read from global memory, the data are searched first in the L1 cache (high bandwidth, 1.600GB/s on Fermi, but limited in size, 48KB on Fermi), then, if not present in L1, they are searched in L2 (lower bandwidth, but larger than L1, 768KB on Fermi) and, and finally, if not present in L2, they are loaded from global memory*.

When a global memory load occurs, the data are moved to L2 and then to L1, so to be able to access them in a faster way next time a global memory read is required.

Possibly, such data are evicted by a subsequent global memory load, possibly not. So, in principle, if you are reading "small" chunks of data, you do not need to necessarily force the data to be located in the shared memory to access them next time in a fast way.

Take into account that, on Fermi and Kepler, shared memory is made by the same circuitry of the L1 cache. You can then see the shared memory as a controlled L1 cache.

You should then force the data to reside in the shared memory to be sure that they reside on, say, the "fastest available cache" and you do it whenever you need to access those same data a multiple number of times.

Note that the above is the general philosophy behind global memory transfers. Implementation details can differ depending on the underlying architecture.

*Il should be noticed that the L1 cache line could be disabled by a compiler option. This is useful in terms of performance for random access data patterns.

Vitality
  • 20,705
  • 4
  • 108
  • 146
  • I believe a read from global is read from L1 cache __only__ if L1 is __enabled__ with a compilation flag. So what you say in the first paragraph is not entirely true. – KiaMorot Jun 14 '13 at 13:00
  • 1
    Quoting Rob Farber, _CUDA Application Design and Development_, page 113, "All data loads and stores go through the L2 cache". You typically disable L1 cache in the case of completely random memory accesses, see [CUDA programming - L1 and L2 caches](http://stackoverflow.com/questions/10180949/cuda-programming-l1-and-l2-caches). – Vitality Jun 14 '13 at 13:12
  • Of course. I said that your first paragraph is not entirely/necessarily true. – KiaMorot Jun 14 '13 at 13:19
  • I edited my answer to account for the possibility of disabling the L1 cache line. – Vitality Jun 14 '13 at 13:36
  • Thanks very much for clearing that up +1 n selected ans. Its pretty cool how GPU's work. – Hans Rudel Jun 14 '13 at 15:55
  • 1
    For compute capability 2.* devices global memory is cached by default. The flag -dlcm=cg can be used to only cache in L2. For compute capability 3.* devices global memory is only cached in L2. On compute capability 3.5 devices global memory can be accessed through texture using LDG instruction. If you are doing a moving average for all 32 threads per warp it may be faster to use shared memory as shared memory will require only 1 issue per access and L1 will require 2 issues as the data will span 2 cache lines for 4 out of 5 accesses. – Greg Smith Jun 14 '13 at 22:20
  • @GregSmith Following your comment, I have introduced a disclaimer that the answer refers to the general philosophy behind global memory transfers and implementation details can differ depending on the underlying architecture. – Vitality Jun 15 '13 at 18:34
  • Are coalesced reads necessary to maximize L1 or L2 cache bandwidth and/or latency? Also, does disabling the L1 cache line on CUDA Compute Devices of 3.* or greater result in any performance increase? – tantrev Jan 09 '15 at 21:08