__threadfence() and L1 cache coherence

Question

It is my understanding (see e.g. How can I enforce CUDA global memory coherence without declaring pointer as volatile?, CUDA block synchronization differences between GTS 250 and Fermi devices and this post in the nvidia Developer Zone) that __threadfence() guarantees that a global writes will be visible to other threads before the thread continues. However, another thread could still read a stale value from its L1 cache even after the __threadfence() has returned.

That is:

Thread A writes some data to global memory, then calls __threadfence(). Then, at some time after __threadfence() has returned, and the writes are visible to all other threads, Thread B is asked to read from this memory location. It finds it has the data in L1, so loads that. Unfortunately for the developer, the data in Thread B's L1 is stale (i.e. it is as before Thread A updated this data).

First of all: is this correct?

Supposing it is, then it seems to me that __threadfence() is only useful if either one can be certain that data will not be in L1 (somewhat unlikely?) or if e.g. the read always bypasses L1 (e.g. volatile or atomics). Is this correct?

I ask because I have a relatively simple use-case - propagating data up a binary tree - using atomically-set flags and __threadfence(): the first thread to reach a node exits, and the second writes data to it based on its two children (e.g. the minimum of their data). This works for most nodes, but usually fails for at least one. Declaring the data volatile gives consistently correct results, but induces a performance hit for the 99%+ of cases where no stale value is grabbed from L1. I want to be sure this is the only solution for this algorithm. A simplified example is given below. Note that the node array is ordered breadth-first, with the leaves beginning at index start and already populated with data.

__global__ void propagate_data(volatile Node *nodes,
                               const unsigned int n_nodes,
                               const unsigned int start,
                               unsigned int* flags)
{
    int tid, index, left, right;
    float data;
    bool first_arrival;

    tid = start + threadIdx.x + blockIdx.x*blockDim.x;

    while (tid < n_nodes)
    {
        // We start at a node with a full data section; modify its flag
        // accordingly.
        flags[tid] = 2;

        // Immediately move up the tree.
        index = nodes[tid].parent;
        first_arrival = (atomicAdd(&flags[index], 1) == 0);

        // If we are the second thread to reach this node then process it.
        while (!first_arrival)
        {
            left = nodes[index].left;
            right = nodes[index].right;

            // If Node* nodes is not declared volatile, this occasionally
            // reads a stale value from L1.
            data = min(nodes[left].data, nodes[right].data);

            nodes[index].data = data;

            if (index == 0) {
                // Root node processed, so all nodes processed.
                return;
            }

            // Ensure above global write is visible to all device threads
            // before setting flag for the parent.
            __threadfence();

            index = nodes[index].parent;
            first_arrival = (atomicAdd(&flags[index], 1) == 0);
        }
        tid += blockDim.x*gridDim.x;
    }
    return;
}

In your reasonings, the `volatile` keyword acts basically to disable the use of the cache. Now, your problem is that L1 cache is not coherent. But L2 is coherent. What about disabling the L2 cache instead of using `volatile`? — Vitality, Oct 25 '13 at 21:15
@RobertCrovella Sorry Robert, it was a misprint. I actually meant disabling L1 (not L2), which is done by `-Xptxas -dlcm=cg`. My understanding, as also it seems to be inferred from your answer, is that the use of `volatile`, in conjunction with `__threadfence()`, bypasses L1. So, I was wondering if disabling the L1 cache would have the same effect as of using `volatile`, in conjunction with `__threadfence()`. A drawback would be that `-Xptxas -dlcm=cg` would disable the L1 cache for the whole execution, instead `volatile` would be "selective". — Vitality, Oct 26 '13 at 12:43
@JackOLantern - I will look into this. I'm assuming, based on your suggestion, that `volatile` writes will bypass L2 as well? And so performance may actually be increased by allowing reads/writes to L2, which is coherent, in spite of disabling *all* use of L1? — Sam, Oct 26 '13 at 14:32
Ah. So disabling L1 completely would presumably be worse than selective, occasional use of `volatile`, then? — Sam, Oct 26 '13 at 16:48
@Sam I was not suggesting that `volatile` bypasses L2. Following Robert, I think we can conclude that `volatile` _alone_ (not in conjunction with `__threadfence()` as I was writing) bypasses the L1 cache. The same effect could be _in principle_ obtained by disabling the L1 cache, but this could give rise to drawbacks in the efficiency of memory reads/writes. Alternatively, it seems that you could disable the L1 cache for only one variable, see [CUDA disable L1 cache only for one variable](http://stackoverflow.com/questions/12553086/cuda-disable-l1-cache-only-for-one-variable). — Vitality, Oct 26 '13 at 20:44
@JackOLantern Ah, sorry, I misunderstood. Thanks for the link, though - it looks like it would have the same affect as volatile, whilst being even more fine-grained (i.e. I could make sure that *only* the memory read bypasses L1). — Sam, Oct 26 '13 at 22:47

Robert Crovella · Accepted Answer · 2013-10-26T15:03:29.700

4

First of all: is this correct?

Yes, __threadfence() pushes data into L2 and out to global memory. It has no effect on the L1 caches in other SMs.

Is this correct?

Yes, if you combine __threadfence() with volatile for global memory accesses, you should have confidence that values will eventually be visible to other threadblocks. Note, however that synchronization between threadblocks is not a well-defined concept in CUDA. There are no explicit mechanisms to do so and no guarantee of the order of threadblock execution, so just because you have code that has a __threadfence() somewhere operating on a volatile item, still does not really guarantee what data another threadblock may pick up. That is also dependent on the order of execution.

If you use volatile, the L1 (if enabled -- current Kepler devices don't really have L1 enabled for general global access) should be bypassed. If you don't use volatile, then the L1 for the SM that is currently executing the __threadfence() operation should be consistent/coherent with L2 (and global) at the completion of the __threadfence() operation.

Note that the L2 cache is unified across the device and is therefore always "coherent". For your use case, at least from the device code perspective, there is no difference between L2 and global memory, regardless of which SM you are on.

And, as you indicate, (global) atomics always operate on L2/global memory.

edited Oct 26 '13 at 15:03

answered Oct 26 '13 at 01:28

Robert Crovella

143,785
11
213
257

Aha! That also explains why I was seeing the issue on Fermi hardware, but not on Kepler. – Sam Oct 26 '13 at 14:21
I am still slightly confused by *when* to use `__threadfence()`, though. Take the example in the CUDA Programming Guide on [memory fence functions](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions), for example. Is it not possible that threads in the last block, when doing the total sum, could read stale values of `result` from L1? E.g. they may read `result[blockIdx.x-1]` as 0? – Sam Oct 26 '13 at 14:26
1

`__threadfence()` *doesn't* bypass L1 (`volatile` *does*). The previous writes (and therefore cacheline loads) into L1 are all suffixed by a `__threadfence()` operation in that example. This therefore guarantees that the L1 cachelines that got loaded to service the previous writes to `result[blockIdx.x]` are all either consistent with L2/global or invalid. Either way, there is no stale data. I'm just restating here what I already said in my answer (and what is referred to in the comments in that example), so I may not be understanding your confusion. – Robert Crovella Oct 26 '13 at 14:57
Yeah, I don't think I explained it clearly. For simplicity, say the kernel runs using two blocks, with `blockIdx.x`s of 0 and 1. Suppose also that Block 0 finishes first - it writes its partial sum to `result[0]`, waits for the write to be flushed, and sets the flag. Then Block 1, finishing last, also does the above, then attempts to do `totalSum = result[0] + result[1]`. Its version of `result[1]` will be consistent with L2/global because of `__threadfence()`, but could a stale value for `result[0]` not be in L1 (and if not, why)? – Sam Oct 26 '13 at 16:59
I should have noted above that I am assuming Block 0 and Block 1 run on *different* SMs. Then, my understanding is that, since no threads in Block 1 will have touched `result[0]`, `__threadfence()` does not guarantee that any copy of `result[0]` in Block 1's L1 is consistent with L2/global. – Sam Oct 26 '13 at 17:03
L1 operations occur on a cacheline basis. The first question that needs to be answered to follow your inquiry is how did result[0] end up in the L1? Is it due to being in the same cacheline as result[1]? – Robert Crovella Oct 26 '13 at 19:37
Okay, so supposing that it is in the same cache line: if result[1] is consistent with global memory, does this mean result[0] must be, too? And what about if it was not in the same cache line, which seems quite likely in a real-world case with many more blocks? – Sam Oct 26 '13 at 22:54
1

If they are in the same cache line, the dirty value in result[0] will get updated when the dirty value in result[1] gets flushed -- the whole cacheline is serviced. If they are in different cachelines, then I have to ask again, how did it get there and be dirty? The code you referenced has every write to result[...] bracketed by a `__threadfence()`. If you're suggesting some other piece of code did it, then what code is that? It's not from this code, so I guess that means you're talking about some other kernel. If that other kernel is also accessing result, then no, there's no guarantees. – Robert Crovella Oct 26 '13 at 22:55

__threadfence() and L1 cache coherence

1 Answers1