2

I've written a very simple code ask thread 0 to update a global variable while other threads keep reading that variable.But I found other threads don't really get the value.

Code is here, it is quite simple. Could anyone give me any suggestion how to fix it? Thanks a lot

__global__ void addKernel(int *c)
{
int i = threadIdx.x;
int j = 0;
if (i == 0)
{
    while(*c < 2000){
        int temp = *c;
        printf("*c = %d\n",*c);
        atomicCAS(c,temp, temp+1);
    }       
}else{
    while(*c < 1000)
    {
        j++;
    }
}

}

CinCout
  • 9,486
  • 12
  • 49
  • 67
neilyo
  • 41
  • 2
  • 1
    `volatile` and `atomic` are orthogonal concepts and `volatile` isn't needed when using atomic operations. What are you trying to do? – user703016 Oct 21 '14 at 14:41
  • yes, I think so. But by this code, we can prove atomic operations don't really make other threads have a real-time update for that value. – neilyo Oct 21 '14 at 14:55
  • Yes, that's right. The other threads don't know `c` can be modified from elsewhere, so they're still reading from registry/cache. – user703016 Oct 21 '14 at 15:21
  • 1
    @Cicada It's not completely clear to me what do you mean by _reading from registry/cache_. Atomic operations bypass L1 and use L2, whose visibility is device-wide. – Vitality Oct 21 '14 at 15:32
  • @JackOLantern Absolutely, but the other reads `while(*c < ...)` are not atomic loads: they are still reading through the (L1) cache (or registers, depending on compiler optimisations). In this case `volatile` is required to force a load bypassing the caches. – user703016 Oct 21 '14 at 15:37
  • @Cicada But then this appears somewhat contradictory with your original comment :) Would a __syncthreads be enough to enable visibility to threads with non vanishing index? – Vitality Oct 21 '14 at 15:59
  • @JackOLantern Ok, my initial comment might seem a bit confusing. I'll try to reword: "`volatile` isn't needed when you're consistently using atomic loads **and** stores". Here, only the store is atomic, the loads aren't. However, I don't know of any way to manually specify in CUDA that you want an atomic load (think `std::atomic<>::load()`). So the `volatile` qualifier is a possible solution here, another would be to "fake" atomic loads (for example, `while(atomicAdd(c, 0) < 2000)`). – user703016 Oct 21 '14 at 16:12
  • But if I use atomic function for a viriable, then I could not be about to use volatile for it....CUDA doesn't allow them together – neilyo Oct 21 '14 at 16:26
  • Are you sure this code is not likely to deadlock? – Vitality Oct 21 '14 at 17:01
  • Yes, as thread 0 keeps updating viriable *c. And other threads keep reading *c. After *c is greater than 1000, other threads should terminate. And after *c is greater than 2000, thread 0 terminate. – neilyo Oct 21 '14 at 17:40
  • @JackOLantern By the way, if I add a printf in the while loop inside else, then it will be fine. – neilyo Oct 21 '14 at 17:42
  • @Cicada But if I use atomic function for a viriable, then I could not be about to use volatile for it....CUDA doesn't allow them together – neilyo Oct 21 '14 at 17:46
  • I think Cicada is right. The atomic functions only pertain (visibility-wise) to the threads they are executed in. Other threads derive no benefit from the fact that an atomic function is ocurring elsewhere. And the statement that "CUDA doesn't allow them together" is not correct either. CUDA provides no overloaded prototypes for atomic functions on `volatile` pointers, but you can certainly cast the `volatile` pointer to a non-volatile one for use in the atomic. `volatile` has no effect on behavior there anyway, but it will impact behavior in other threads. – Robert Crovella Oct 21 '14 at 18:23
  • Hi, I got it fixed. The thing is read is not atomic. So we need to declare it as volatile, and convert it to (int*) when use atomic functions. Thanks a lot. @RobertCrovella – neilyo Oct 21 '14 at 19:16

1 Answers1

3

I'd like to make an analogy: imagine for a second that atomic operations are mutexes: for a program to be well-defined, two threads accessing a shared resource must both agree to use the mutex to access the resource exclusively. If one of the threads accesses the resource without first holding the mutex, the result is undefined.

The same thing is true for atomics: if you decide to treat a particular location in memory as an atomic variable, then all threads accessing that location should agree and treat it as such for your program to have meaning. You should only be manipulating it through atomic loads and stores, not a combination of non-atomic and atomic operations.

In other words, this:

atomicCAS(c,temp, temp+1);

Contains an atomic load-compare-store. The resulting instruction will go all the way down to global memory to load c, do the comparison, and go all the way down to global memory to store the new value.

But this:

while(*c < 2000)

Is not atomic by any means. The compiler (and the hardware) has no idea that c may have been modified by another thread. So instead of going all the way down to global memory, it will simply read from the fastest available cache. Possibly the compiler will even put the variable in a register, because it doesn't see anyone else modifying it in the current thread.

What you would want is something like (imaginary):

while (atomicLoad(c) < 2000)

But to the best of my knowledge there is no such construct in CUDA at the time of writing.

In this regard, the volatile qualifier may help: it tells the compiler to not optimize the variable, and consider it as "modifiable from external sources". This will trigger a load for every read of the variable, although I am not sure this load bypasses all the caches. In practice, it may work, but in theory I don't think you should rely on it. Besides, this will also disable any optimizations on that variable (such as constant propagation or promoting the variable to a register for better performance).

You may want to try the following hack (I haven't tried it):

while(atomicAdd(c, 0) < 2000)

This will emit an atomic instruction that does load from global memory, and therefore should see the most recent value of c. However, it also introduces an (useless in this case) atomic store.

user703016
  • 37,307
  • 8
  • 87
  • 112
  • Hi, I got it fixed. The thing is read is not atomic. So we need to declare it as volatile, and convert it to (int*) when use atomic functions. Thanks a lot. @Robert Crovellat – neilyo Oct 21 '14 at 19:16