1

These are my assumptions:

  1. There are two types of loads, cached and uncached. In the first one, the traffic goes through L1 and L2, while in the second one, the traffic goes only through L2.
  2. The default behaviour in Compute Capability 6.x and 7.x are cached accesses.
  3. A L1 cache line is 128 bytes and a L2 cache line is 32 bytes, so for every L1 transaction generated, there should be four L2 transactions (one per each sector.)
  4. In Nsight, a SM->TEX Request means a warp-level instruction merged from 32 threads. L2->TEX Returns and TEX->SM Returns is a measure of how many sectors are transfered between each memory unit.

Assuming Compute Capability 7.5, these are my questions:

  1. The third assumption seems to imply that L2->TEX Returns should always be a multiple of four for global cached loads, but that's not always the case. What is happening here?
  2. Is there still a point in marking pointers with const and __restrict__ qualifiers? That used to be a hint to the compiler that the data is read-only and therefore can be cached in L1/texture cache, but now all data is cached there, both read-only and not read-only.
  3. From my fourth assumption, I would think that whenever TEX->SM Returns is greater than L2->TEX Returns, the difference comes from cache hits. That's because when there's a cache hit, you get some sectors read from L1, but none from L2. Is this true?
rm95
  • 167
  • 6

1 Answers1

6

CC 6.x/7.x

  • L1 cache line size is 128 bytes divided into 4 32 byte sectors. On a miss only addressed sectors will be fetched from L2.
  • L2 cache line size is 128 bytes divided into 4 32 byte sectors.
    • CC 7.0 (HBM) 64B promotion is enabled. If there is a miss to the lower 64 bytes of the cache line the lower 64 bytes will be fetched from DRAM. If there is a miss to the upper 64 bytes of the cache line then the upper 64 bytes will be fetched.
    • CC 6.x/7.5 only accessed 32B sectors will be fetched from DRAM.
  • In terms of L1 cache policy
    • CC 6.0 has load caching enabled by default
    • CC 6.x has load caching disabled by default - see programming guide
    • CC 7.x has load caching enabled by default - see PTX for details on cache control

In Nsight Compute the term requests varies between 6.x and 7.x.

  • For 5.x-6.x the number of requests per instruction varied by the type of operation and the width of the data. For example 32-bit load is 8 threads/request, 64-bit load is 4 threads/request, and 128-bit load is 2 threads/request.
  • For 7.x requests should be equivalent to instructions unless access pattern has address divergence that causes serialization.

Answering your CC 7.5 Questions

  1. The third assumption seems to imply that L2->TEX Returns should always be a multiple of four for global cached loads, but that's not always the case. What is happening here?

The L1TEX unit will only fetch the missed 32B sectors in a cache line.

  1. Is there still a point in marking pointers with const and restrict qualifiers? That used to be a hint to the compiler that the data is read-only and therefore can be cached in L1/texture cache, but now all data is cached there, both read-only and not read-only.

The compiler can perform additional optimizations if the data is known to be read-only.

  1. From my fourth assumption, I would think that whenever TEX->SM Returns is greater than L2->TEX Returns, the difference comes from cache hits. That's because when there's a cache hit, you get some sectors read from L1, but none from L2. Is this true?

L1TEX to SM return B/W is 128B/cycle. L2 to SM return B/W is in 32B sectors.

The Nsight Compute Memory Workload Analysis | L1/TEX Cache table shows

  • Sector Misses to L2 (32B sectors)
  • Returns to SM (cycles == 1-128B)
Greg Smith
  • 11,007
  • 2
  • 36
  • 37
  • May I ask where did you find the first information you mentioned? The last reference to that I found in the programming guide is for Compute Capability 3.x, and it says data moves in 128-byte memory transactions between caches. – rm95 Aug 20 '20 at 18:35
  • I'm still a little confused, though. I've always assumed the unit of transfer is cache lines for caches in general. A cache hit, as far as I understand, means that the line you are trying to read is cached. But if transfers between L1 and L2 happen in sectors, it's conceivable that a line might have missing sectors in it. So either a cache hit in L1 can still require fetching data from L2 (missing sector in the line), or a cache hit is defined in terms of sectors and not cache lines. Which is it? – rm95 Aug 20 '20 at 18:36
  • Regarding the last part, in my report TEX->SM Returns is greater than the amount of Elapsed Cycles. Does that make sense? I'd expect it to be equal or lower, since, if I understood you correctly, it's measuring the amount of cycles in which a 1-128 byte chunk of data was transferred. – rm95 Aug 20 '20 at 18:37
  • 1
    The GPU has a specialized cache. Organizing the cache as 4 32B sector per cacheline reduces the cost of tag lookup. Not forcing promotion (grabbing all 128B) is much more efficient in most cases. In CC 3.x you could do cached or uncached accesses (which the profiler clearly showed). Cached loaded would promote to 128B. Uncached loads would only read the sectors accessed by the instruction. – Greg Smith Aug 20 '20 at 18:39
  • 1
    The maximum throughput of TEX to SM returns is 1 per sm__cycle_active. If it is greater than there is some type of inconsistency between replays that may be causing an issue. I would recommend you post the report on the devtalk Nsight Compute forum. – Greg Smith Aug 20 '20 at 18:41