CUDA Global Barrier -- Works on Kepler and not Fermi

Question

The following global barrier works on Kepler K10 and not Fermi GTX580:

__global__ void cudaKernel (float* ref1, float* ref2, int* lock, int time, int dim) {
  int gid  = blockIdx.x * blockDim.x + threadIdx.x;
  int lid  = threadIdx.x;                          
  int numT = blockDim.x * gridDim.x;               
  int numP = int (dim / numT);                     
  int numB = gridDim.x;

  for (int t = 0; t < time; ++t) {
    // compute @ time t
    for (int i = 0; i < numP; ++i) {
      int idx  = gid + i * numT;
      if (idx > 0 && idx < dim - 1)
        ref2 [idx]  = 0.333f * ((ref1 [idx - 1] + ref1 [idx]) + ref1 [idx + 1]);
    }

    // global sync
    if (lid == 0){
      atomicSub (lock, 1);
      while (atomicCAS(lock, 0, 0) != 0);
    }
    __syncthreads();

    // copy-back @ time t
    for (int i = 0; i < numP; ++i) {
      int idx  = gid + i * numT;
      if (idx > 0 && idx < dim - 1)
        ref1 [idx]  = ref2 [idx];
    }

    // global sync
    if (lid == 0){
      atomicAdd (lock, 1);
      while (atomicCAS(lock, numB, numB) != numB);
    }
    __syncthreads();
  }
}

So, by looking at the output sent back to CPU, I noticed that one thread (either 1st or last thread) escapes the barrier and resumes execution earlier than the others. I'm using CUDA 5.0. number of blocks is also always smaller than number of SMs (in my set of runs).

Any idea why the same code wouldn't work on two architectures? What's new in Kepler that helps this global synchronization?

It would be good if you wrapped more substance around your assertion that this doesn't work. Providing a specific complete example, along with actual and expected output on the two devices you tested, would be helpful. Based on what I see, since the accesses generated by the atomicCAS instructions will be serialized amongst threadblocks, I fully expect the threadblocks to serially exit the "barrier". So I would expect one threadblock to resume execution earlier than the others, in any case. Thus my interest in what you are defining as the difference and "pass" or "fail". — Robert Crovella, Jan 09 '13 at 19:15
Kepler GK110 global atomic operations are [significantly faster than Fermi](http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf). In the case of atomics issued from separate SMs, they may be effectively back-to-back, and complete at the core clock rate of Kepler GK110 (K20). My point is that this would be much faster than the Fermi serialization/completion. However this doesn't affect the fact that exiting from the barrier will still be a serial operation amongst threadblocks, according to your code. And K10 is not GK110. — Robert Crovella, Jan 09 '13 at 19:24
So the code I've written gives every CUDA thread an index (e.g. index i which is mapped to a 1D array A[i]) and requires each one of the threads to grab 3 neighboring data elements (A[i-1], A[i], A[i+1]) from global memory and compute the average. later on, after each thread stops at the "global" barrier, goes ahead and updates the original data array. And this whole thing repeats in for couple of times. so each thread does the following loop: for (t=0 to T) {compute; sync; copy; sync} — Naseri00n, Jan 09 '13 at 19:49
@Naseria Since the kernel is performing global memory accesses, you should use a global memory fence in addition to the global barrier. You also may need to qualify pointers as volatile or compile with L1 caches disabled. Please provide a complete example as Robert suggested. — Heatsink, Jan 09 '13 at 21:10
I updated the original post with the CUDA kernel. It's a 1-D 3-point jacobi style code. The input array is "ref1", "ref2" is the temporary array used for copy-back, "lock" is the global mutex, "time" is the number of time steps and "dim" is the total number of data elements. I had already tested memory fence "__threadfence()" but it didn't help. However, the "volatile" trick was helpful. The problem seems to be on how some of the data elements are being cached in L1 and by using "volatile" keyword, I make sure those elements "ref1 and ref2 arrays" are not going to be cached in L1. — Naseri00n, Jan 09 '13 at 22:31
So now it works on Fermi too. But, I'm still not clear on why it works on Kepler anyway? Does Kepler support any type of "cache coherence"? — Naseri00n, Jan 09 '13 at 22:31

score 1 · Answer 1 · answered Jan 09 '13 at 22:55

So I suspect the barrier code itself is probably working the same way. It's what's happening on other data structures not associated with the barrier functionality itself that is at issue, it seems.

Niether Kepler nor Fermi have L1 caches that are coherent with each other. What you have discovered (although it's not associated with your barrier code itself) is that the L1 cache behavior is different between Kepler and Fermi.

In particular, Kepler L1 cache is not in play on global loads as described in the above link, and so the caching behavior is handled at L2 level which is device-wide, and therefore coherent. When a Kepler SMX reads it's global data, it's getting coherent values from L2.

On the other hand, Fermi has L1 caches that also participate in global loads (by default -- although this behavior can be turned off) and the L1 caches as described in the link above are unique to each Fermi SM and are non-coherent with the L1 caches in other SMs. When a Fermi SM reads it's global data, it's getting values from the L1, which may be non-coherent with other L1 caches in other SMs.

This is the difference in "coherency" that you are seeing, of the data you are manipulating before and after the barrier.

As I mentioned, I believe the barrier code itself is probably working the same way on both devices.

I am still not quite convinced about the barrier functionality. One thing strange about this code is that when I wasn't using "volatile", and for t > 1 (meaning, re-using already cached data), only "one" thread in each thread block (either first or last thread) seems to be not getting the "most-updated-value" and thus messing the results. Even though it's already resolved with the "volatile" solution, but still it's a mystery why only one thread? This was the reason that I first started investigating the global barrier. — Naseri00n, Jan 10 '13 at 02:04
you said that `volatile` when added to ref1 and ref2 fixes the code. Is that correct? If so I see no logical connection between that and the barrier code. If you're convinced the barrier code is broken, perhaps you can create a simle proof case which doesn't depend on the other data structures that are confusing things in this example. — Robert Crovella, Jan 10 '13 at 02:20

CUDA Global Barrier -- Works on Kepler and not Fermi

1 Answers1

Linked