In GPUs the transactions to the L2 cache can be of size 32B, 64B or 128B (both read and write). And the total number of such transactions can be measured using nvprof metrics like gst_transactions and gld_transactions. However, I am unable to find any material that details how these transactions are mapped for DRAM access i.e how are these transactions being handled by the DRAM which usually has a different bus width? For example, the TitanXp GPU has a 384 bit global memory bus and the P100 has a 3072 bit memory bus. So how are the 32B, 64B or 128B instructions mapped to these memory buses. And how can I measure the number of transactions generated by the DRAM controller?
PS: The dram_read_transactions metric does not seem to do this. I say that because I get the same value for dram_read_transactions on the TitanXp and the P100 (even during sequential access) in-spite of the two having widely different bus widths.