How is it possible to have a Global Memory Cache Replay over 100%?

Question

I have a CUDA kernel that I was benchmarking, and the Global Memory Cache Replay showed as 216.9%

This doesn't quite make sense to me. The only way I can see cache misses happening over 100% is if it is missing on multiple cache levels, but this doesn't seem like that should be the case here.

Any insight as to why this is the case?

I had experienced something similar. See [this post](http://stackoverflow.com/q/19650777/2386951). They probably have same origin. — Farzad, Jan 12 '14 at 20:54
It looks like this is the case. Could you post this as answer so that I can accept it? — PseudoPsyche, Jan 13 '14 at 18:15

score 2 · Accepted Answer · edited May 23 '17 at 11:57

Similar issue had happened to me. I was getting Global Load Efficiency over 100%. Here is the link to it. Since I think both of these phenomena have same origin, I quote the answer I got:

Global Load Efficiency and Global Store Efficiency describe how well the coalescing of DRAM-accesses and (L2?) Cache-accesses works. If they're 100 percent then you've got perfect coalescing. Since efficiencies above 100 percent don't make any sense (you cannot be better than optimal) this has to be an error. This error is caused by the Visual Profiler, which counts hardware events to calculate some abstract metrics. But the GPU doesn't have the "correct" events to exactly calculate all those metrics, thus Visual Profiler has to estimate those metrics by using some complex formula and "wrong" events. There are some metrics which are just rough estimations and Global Load Efficiency and Global Store Efficiency are two of them. Thus if such an efficiency is bigger than 100 percent it is an estimation error. As far as I observed the Global Load Efficiency and Global Store Efficiency both increased above 100 percent in some of my register spilling kernels. That's why i assume that the Visual-Profiler uses some events, which also may be caused by local memory accesses, to calculate those two efficiencies. Furthermore GPUs just uses 32 Bit Counters. Thus long running kernel tend to overflow those counters, which also causes the Visual Profiler to display wrong metrics.

How is it possible to have a Global Memory Cache Replay over 100%?

1 Answers1