It is my understanding (see e.g. How can I enforce CUDA global memory coherence without declaring pointer as volatile?, CUDA block synchronization differences between GTS 250 and Fermi devices and this post in the nvidia Developer Zone) that __threadfence()
guarantees that a global writes will be visible to other threads before the thread continues. However, another thread could still read a stale value from its L1 cache even after the __threadfence()
has returned.
That is:
Thread A writes some data to global memory, then calls __threadfence()
. Then, at some time after __threadfence()
has returned, and the writes are visible to all other threads, Thread B is asked to read from this memory location. It finds it has the data in L1, so loads that. Unfortunately for the developer, the data in Thread B's L1 is stale (i.e. it is as before Thread A updated this data).
First of all: is this correct?
Supposing it is, then it seems to me that __threadfence()
is only useful if either one can be certain that data will not be in L1 (somewhat unlikely?) or if e.g. the read always bypasses L1 (e.g. volatile or atomics). Is this correct?
I ask because I have a relatively simple use-case - propagating data up a binary tree - using atomically-set flags and __threadfence()
: the first thread to reach a node exits, and the second writes data to it based on its two children (e.g. the minimum of their data). This works for most nodes, but usually fails for at least one. Declaring the data volatile
gives consistently correct results, but induces a performance hit for the 99%+ of cases where no stale value is grabbed from L1. I want to be sure this is the only solution for this algorithm. A simplified example is given below. Note that the node array is ordered breadth-first, with the leaves beginning at index start
and already populated with data.
__global__ void propagate_data(volatile Node *nodes,
const unsigned int n_nodes,
const unsigned int start,
unsigned int* flags)
{
int tid, index, left, right;
float data;
bool first_arrival;
tid = start + threadIdx.x + blockIdx.x*blockDim.x;
while (tid < n_nodes)
{
// We start at a node with a full data section; modify its flag
// accordingly.
flags[tid] = 2;
// Immediately move up the tree.
index = nodes[tid].parent;
first_arrival = (atomicAdd(&flags[index], 1) == 0);
// If we are the second thread to reach this node then process it.
while (!first_arrival)
{
left = nodes[index].left;
right = nodes[index].right;
// If Node* nodes is not declared volatile, this occasionally
// reads a stale value from L1.
data = min(nodes[left].data, nodes[right].data);
nodes[index].data = data;
if (index == 0) {
// Root node processed, so all nodes processed.
return;
}
// Ensure above global write is visible to all device threads
// before setting flag for the parent.
__threadfence();
index = nodes[index].parent;
first_arrival = (atomicAdd(&flags[index], 1) == 0);
}
tid += blockDim.x*gridDim.x;
}
return;
}