What is the difference between dram_read_transactions and gld_transactions in CUDA profiler?

Question

In CUDA profiler, there are two metrics called dram_read_transactions and gld_transactions. The cuda profiler user guide says "gld_transactions" means the number of global memory load transactions, while "dram_read_transactions" means device memory read transactions. I cannot tell the difference between these descriptions because reading data means loading data and global memory is dram. But the profiling results of these two metrics are different. I tested with one kernel. For the same kernel with different threads settings, the gld_transactions is always the same value 33554432. And this value is stable. But for dram_read_transactions, two different threads settings lead to different values, they are roughly 4486636 and 4197096. For the word "roughly" I mean these values are not stable because they slightly change from one execution to another. We can also see the dram_transactions is much less than gld_transactions. So my questions can be summarized here:

What is the real difference between gld_transactions and dram_read_transactions?
Why the dram_read_transactions is much smaller than gld_transactions?
For different threads settings, why the gld_transactions value is stable while dram_read_transactions is unstable?

I think once we know the answer for question (1), then questions (2) and (3) can be easily explained. So can anyone explain this? Thanks in advance.

gld_transactions is the number of 128bytes L1 transactions (this includes uncached loads). dram_read_transactions is the number of 32bytes to the actual device memory. If you have CC2.0 or above there is a two level cache hierarchy between SM and DRAM. gld_transactions is more stable as it can be collect across the entire chip. dram_read_transactions can only be collected from 1/2 the interfaces in a pass and there is variability due to the contents of L1 and L2. — Greg Smith, Dec 10 '14 at 03:59

score 11 · Accepted Answer · edited Jun 20 '20 at 09:12

A global load refers to a logical memory space. A dram read refers to a transaction on a physical resource. This statement of yours:

reading data means loading data and global memory is dram.

is either incorrect or glossing over important details.

Fundamentally, global loads are issued by instructions executed by a warp. The initial target of these loads will be L1 or L2 cache (usually). A global load, if satisfied by cache contents, will never become a dram read transaction. On the other hand, if the target of the global load is not in a cache, then it will become a dram read transaction (typically/usually).

Furthermore, the global memory space is not the only memory space. There are other memory spaces, such as local. Transactions to "local" memory can also ultimately be serviced in a variety of ways, one of which would be actually triggering a dram read. Such a transaction would not show up in any "global" metric but would show up in the dram read transaction metric.

I find this diagram/chart in the nsight VSE documentation (and tool help), of the logical and physical arrangement of memory on a GPU to be helpful in inderstanding this. I have excerpted the chart here, and highlighted in red the "links" that correspond to the metrics you identified:

GPU logical/physical memory diagram showing two different transaction types

This answer gives a more detailed decoding of the above diagram, for relevant metrics.

What is the difference between dram_read_transactions and gld_transactions in CUDA profiler?

1 Answers1

Linked