3

In GPUs the transactions to the L2 cache can be of size 32B, 64B or 128B (both read and write). And the total number of such transactions can be measured using nvprof metrics like gst_transactions and gld_transactions. However, I am unable to find any material that details how these transactions are mapped for DRAM access i.e how are these transactions being handled by the DRAM which usually has a different bus width? For example, the TitanXp GPU has a 384 bit global memory bus and the P100 has a 3072 bit memory bus. So how are the 32B, 64B or 128B instructions mapped to these memory buses. And how can I measure the number of transactions generated by the DRAM controller?

PS: The dram_read_transactions metric does not seem to do this. I say that because I get the same value for dram_read_transactions on the TitanXp and the P100 (even during sequential access) in-spite of the two having widely different bus widths.

Johns Paul
  • 633
  • 6
  • 22

1 Answers1

5

Although GPU DRAM may have different (hardware) bus widths across different GPU types, the bus is always composed of a set of partitions, each of which has an effective width of 32 bytes. A DRAM transaction from the profiler perspective actually consists of one of these 32-byte transactions, not a transaction at full "bus width".

Therefore a (single) 32 byte transaction to L2, if it misses in the L2, will convert to a single 32-byte DRAM transaction. Transactions of higher granularity, such as 64-byte or 128-byte, will convert into the requisite number of 32-byte DRAM transactions. This is discoverable using any of the CUDA profilers.

These related questions here and here may be of interest as well.

Note that an "effective width" of 32 bytes, as used above, does not necessarily mean that a transaction requires 32bytes * 8bits/byte = 256 bit wide interface. DRAM busses can be "double-pumped" or "quad-pumped" which means a transaction may consist of multiple bits transferred per "wire" of the interface. Therefore you will find GPUs that have only a 128-bit wide (or even 64-bit wide) interface to GPU DRAM, but a "transaction" on these busses will still consist of 32-bytes, which will require multiple bits to be transferred (probably in multiple DRAM bus clock cycles) per "wire" of the interface.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thanks a lot for your quick and detailed reply. I just have one more doubt. So when you say its partitions of 32 bytes, does it mean that these partition addresses have to be sequential? Eg. when accessing bytes 0-32 and bytes 2048 to 2080 can this be done in the same cycle? (assuming the DRAM has at-least a 64 byte wide memory bus). The reason I ask is that the P100 has a much wider bus at lower clock than the TitanXp (with same max global memory bandwidth). And more random the global memory accesses are, the better is the TitanXp performance relative to the P100. – Johns Paul May 23 '18 at 10:47
  • That will depend on the specific GPU. It can't be answered in the general case, and as far as I know, such detailed information about the mapping of sequential bytes in the global logical address space to DRAM segments or transactions is unpublished. Your observation probably is insightful in the comparison between Titan Xp and P100 in this regard, but I could not provide a categorical answer. – Robert Crovella May 23 '18 at 10:51
  • Oh. I was trying to collect some data to support my observation and to make sure I wasnt missing any other factors. I guess I will have to figure out something else. Anyway thanks again for your help. :) – Johns Paul May 23 '18 at 11:05
  • @JohnsPaul: I don't think the wide-bus/lower-clock should have an effect: If it's the same number of transactions per second (i.e. same bandwidth), the details shouldn't be significant. Of course it's not actually the same bandwidth between these two cards. – einpoklum May 23 '18 at 11:08
  • The wide bus on HBM2 systems will certainly have an effect for certain "scattered" access patterns, and OP's observation is in line with that. However the exact pattern excitation which will produce better or worse efficiency is unpublished AFAIK. It's a safe assumption that contiguous access will not be adversely impacted by the wide bus, and will in fact improve in light of the overall improved bandwidth available on HBM2 systems as compared to GDDR5/5x/6 systems. – Robert Crovella May 23 '18 at 11:14
  • @einpoklum: I am talking about the TitanXp and the P100 PCIe version with 12GB memory. They both have close to 540 GB/s max global memory bandwidth (according to the specs page). – Johns Paul May 23 '18 at 11:15
  • To be clear, CUDA HBM2 systems at 4096 bit width have improved bandwidth over any currently available CUDA GDDR5/5x/6 systems, AFAIK. The particular case of the P100 PCIE with 12GB of memory only has a 3072 bit width, and as a result is about the same bandwidth as the fastest GDDR systems, e.g. the Titan Xp mentioned. – Robert Crovella May 23 '18 at 11:17
  • @Robert Crovella: Thats what im observing in my tests. But since these two devices have small differences in other hardware components as well (like the number of CUDA cores and their operating frequency) I wanted some metric to make sure that the impact is in fact due to the memory hardware. I hope this information would be made available in future. It would be useful in terms of optimizing applications. For example, I am repeatedly partitioning an array and knowing how the memory hardware handles random access could help me tune the fan out at each stage. – Johns Paul May 23 '18 at 11:25
  • Well, yes, conceded, for bus widths exceeding 32 bytes (256 bits), the abstraction of the 32-byte transaction sort of breaks if you can't perform multiple transactions with non-contiguous addresses in the same physical transfer of bus-width bits. – einpoklum May 23 '18 at 11:29
  • AFAIK typical matrix subdivision should not result in reduced efficiency on a HBM2 interface. Typical matrix subdivision will still produce a set of adjacent bytes, and if you can make sure that your subdivision does not drop below 128 consecutive bytes (i.e. one warp of `int` or `float` quantities), you should not run into any reduced efficiency. – Robert Crovella May 23 '18 at 11:33
  • I am doing hash based partitioning of a 1D array (repeatedly). so higher the number of partitions, lower is the chances of coalesced writes. And even if the access size drops below the 128 byte limit (which does not happen in most test cases) the rate of performance drop must be the similar on both hardware right? Basically for me the performance drop on the P100 is much more that on the TitanXp – Johns Paul May 23 '18 at 11:52
  • As I've mentioned already, if you subdivide access granularity enough, I would expect a difference between the two hardware articles you mention. I would expect that with 128byte aligned granularity or above, there wouldn't be a difference in performance based on the subdivision. – Robert Crovella May 23 '18 at 14:43