Swap memory pointers atomically on CUDA

Question

I have two pointers in memory and I want to swap it atomically but atomic operation in CUDA support only int types. There is a way to do the following swap?

classA* a1 = malloc(...);
classA* a2 = malloc(...);
atomicSwap(a1,a2);

What is this `atomicSwap` you are referring to? There is no such thing as far as I am aware — talonmies, Dec 07 '17 at 12:14
Furthermore there is no CUDA atomic operation that holds two separate locations open for read-modify-write activity at the same time, as part of the same operation. Any and all CUDA atomic operations operated **atomically** on one location (address) only. It is not correct to say "atomic operation in CUDA support only int types". There are various atomics that support operations on non-integer types. Also, as already mentioned, there is no `atomicSwap` in CUDA. — Robert Crovella, Dec 07 '17 at 15:09
Finally, for atomic operations limited to integer types, it's possible to convert pointer types to (long) integer types for purposes of copying. You might want to read the programming guide section on atomics, [here](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions). It covers all of these concepts that I just went through. — Robert Crovella, Dec 07 '17 at 15:10

einpoklum · Answer 1 · 2017-12-09T09:30:30.950

When writing device-side code...

While CUDA provides atomics, they can't cover multiple (possibly remote) memory locations at once.

To perform this swap, you will need to "protect" access to both these values with something like mutex, and have whoever wants to write values to them take a hold of the mutex for the duration of the critical section (like in C++'s host-side std::lock_guard). This can be done using CUDA's actual atomic facilities, e.g. compare-and-swap, and is the subject of this question:

Implementing a critical section in CUDA

A caveat to the above is mentioned by @RobertCrovella: If you can make do with, say, a pair of 32-bit offsets rather than a 64-bit pointer, then if you were to store them in a 64-bit aligned struct, you could use compare-and-exchange on the whole struct to implement an atomic swap of the whole struct.

... but is it really device side code?

Your code actually doesn't look like something one would run on the device: Memory allocation is usually (though not always) done from the host side before you launch your kernel and do actual work. If you could make sure these alterations only happen on the host side (think CUDA events and callbacks), and that device-side code will not be interfered with by them - you can just use your plain vanilla C++ facilities for concurrent programming (like lock_guard I mentioned above).

Fabio T. · Answer 2 · 2017-12-08T12:57:03.567

I managed to have the needed behaviour, it is not atomic swap but still safe. The context was a monotonic Linked List working both on CPU and GPU:

template<typename T>
union readablePointer
{
    T* ptr;
    unsigned long long int address;
};

template<typename T>
struct LinkedList
{

    struct Node
    {
        T value;
        readablePointer<Node> previous;
    };

    Node start;
    Node end;

    int size;

    __host__ __device__ void initialize()
    {
        size = 0;

        start.previous.ptr = nullptr;
        end.previous.ptr = &start;
    }

    __host__ __device__ void push_back(T value)
    {
        Node* node = nullptr;
        malloc(&node, sizeof(Node));

        readablePointer<Node> nodePtr;
        nodePtr.ptr = node;

        nodePtr.ptr->value = value;

#ifdef __CUDA_ARCH__
        nodePtr.ptr->previous.address = atomicExch(&end.previous.address, nodePtr.address);
        atomicAdd(&size,1);
#else
        nodePtr.ptr->previous.address = end.previous.address;
        end.previous.address = nodePtr.address;
        size += 1;
#endif

    }

    __host__ __device__ T pop_back()
    {
        assert(end.previous.ptr != &start);

        readablePointer<Node> lastNodePtr;
        lastNodePtr.ptr = nullptr;

#ifdef __CUDA_ARCH__
        lastNodePtr.address = atomicExch(&end.previous.address,end.previous.ptr->previous.address);
        atomicSub(&size,1);
#else
        lastNodePtr.address = end.previous.address;
        end.previous.address = end.previous.ptr->previous.address;
        size -= 1;
#endif
        T toReturn = lastNodePtr.ptr->value;

        free(lastNodePtr.ptr);

        return toReturn;
    }

    __host__ __device__ void clear()
    {
        while(size > 0)
        {
            pop_back();
        }
    }
};

Swap memory pointers atomically on CUDA

2 Answers2

When writing device-side code...

... but is it really device side code?