Tsyvarev's idea (of using a bit inside the pointer) is interesting, and for some use-cases, probably more efficient than storing the boolean separately. For other use cases, storing them separately and using a double-size compare-and-exchange to swap both at once will be best. That makes it more efficient to atomically modify just the boolean or just the ref. Storing them together means you always have to do an atomic read-modify-write to change one but not the other (on x86 either a lock or [mem], reg
or lock cmpxchg
loop if you also need the old value), instead of just an atomic store that doesn't affect the other. (Or an atomic xchg
if you want the old value).
To implement with two separate members of a struct, see my answer on another question about using compare_exchange_weak
on atomic<two_member_struct>
. I'd suggest storing the boolean in a pointer-sized integer, to avoid any padding in the object that needs to be ignored, or that can lead to failed cmpxchg when the padding doesn't match but the data does.
If you often update your MarkableReference
s with a new pointer-and-boolean together, embedding data into the pointer is probably good for performance. Otherwise it's probably bad, if you can get the compiler to make efficient code for the separate-members way.
Also, if you often need to get both the pointer and flag atomically, the embedded data way is good for that.
Tsyvarev's implementation needs to change to implement atomically modifying the boolean without setting a new pointer. class MarkableReference
should itself have an atomic member variable, so it can use fetch_or and stuff like that.
This is untested, but it compiles to code that looks correct (on the Godbolt compiler explorer).
On x86-64, you could let this work even for unaligned pointers, by using the high bit of the pointer instead of the low bit. x86-64 requires that virtual addresses have their top 16 bits matching the 48th bit, i.e. that 64-bit addresses are really sign-extended 48-bit addresses. Future x86-64 CPUs could extend that in the future, though, allowing a full 64-bit virtual address space instead of the current 48-bit. Then you'd have to run programs using this code in a compatibility mode where the OS never gives them addresses that were "non-canonical" according to the old rules.
#include <atomic>
#include <assert.h>
template<class T>
class MarkableReference
{
private:
std::atomic<uintptr_t> val;
static const uintptr_t mask = 1;
uintptr_t combine(T* ref, bool mark) {
return reinterpret_cast<uintptr_t>(ref) | mark;
}
public:
MarkableReference(T* ref, bool mark)
: val(combine(ref, mark))
{
// note that construction of an atomic is not *guaranteed* to be atomic, in case that matters.
// On most real CPUs, storing a single aligned pointer-sized integer is atomic
// This does mean that it's not a seq_cst operation, so it doesn't synchronize with anything
// (and there's no MFENCE required)
assert((reinterpret_cast<uintptr_t>(ref) & mask) == 0 && "only works with pointers that have the low bit cleared");
}
MarkableReference(MarkableReference &other, std::memory_order order = std::memory_order_seq_cst)
: val(other.val.load(order))
{}
// IDK if relaxed is the best choice for this, or if it should exist at all
MarkableReference &operator=(MarkableReference &other)
{
val.store(other.val.load(std::memory_order_relaxed), std::memory_order_relaxed);
return *this;
}
/////// Getters
T* getRef(std::memory_order order = std::memory_order_seq_cst)
{
return reinterpret_cast<T*>(val.load(order) & ~mask);
}
bool getMark(std::memory_order order = std::memory_order_seq_cst)
{
return (val.load(order) & mask);
}
T* getBoth(bool& mark, std::memory_order order = std::memory_order_seq_cst)
{
uintptr_t current = val.load(order);
mark = expected & mask;
return reinterpret_cast<T*>(expected & ~mask);
}
/////// Setters (and exchange)
// memory_order_acq_rel would be a good choice here
T* xchgRef(T* ref, std::memory_order order = std::memory_order_seq_cst)
{
uintptr_t old = val.load(std::memory_order_relaxed);
bool success;
do {
uintptr_t newval = reinterpret_cast<uintptr_t>(ref) | (old&mask);
success = val.compare_exchange_weak(old, newval, order);
// updates old on failure
} while( !success );
return reinterpret_cast<T*>(old & ~mask);
}
bool cmpxchgBoth_weak(T* &expectRef, bool& expectMark, T* desiredRef, bool desiredMark, std::memory_order order = std::memory_order_seq_cst)
{
uintptr_t desired = combine(desiredRef, desiredMark);
uintptr_t expected = combine(expectRef, expectMark);
bool status = compare_exchange_weak(expected, desired, order);
expectRef = reinterpret_cast<T*>(expected & ~mask);
expectMark = expected & mask;
return status;
}
void setRef(T* ref, std::memory_order order = std::memory_order_seq_cst)
{ xchgReg(ref, order); } // I don't see a way to avoid cmpxchg without a non-atomic read-modify-write of the boolean.
void setRef_nonatomicBoolean(T* ref, std::memory_order order = std::memory_order_seq_cst)
{
uintptr_t old = val.load(std::memory_order_relaxed); // maybe provide a way to control this order?
// !!modifications to the boolean by other threads between here and the store will be stepped on!
uintptr_t newval = combine(ref, old&mask);
val.store(newval, order);
}
void setMark(bool mark, std::memory_order order = std::memory_order_seq_cst)
{
if(mark)
val.fetch_or(mask, order);
else
val.fetch_and(~mask, order);
}
bool toggleMark(std::memory_order order = std::memory_order_seq_cst)
{
return mask & val.fetch_xor(mask, order);
}
bool xchgMark(bool mark, std::memory_order order = std::memory_order_seq_cst)
{
// setMark might still compile to efficient code if it just called this and let the compile optimize away the fetch part
uintptr_t old;
if(mark)
old = val.fetch_or(mask, order);
else
old = val.fetch_and(~mask, order);
return (old & mask);
// It might be ideal to compile this to x86 BTS or BTR instructions (when the old value is needed)
// but clang uses a cmpxchg loop.
}
};
Usage examples, with asm output showing that this compiles efficiently. (See the godbolt link above)
int a;
int b;
MarkableReference<int> mr_a(&a, true);
MarkableReference<int> mr_b(&b, false);
bool getbool(MarkableReference<int> &mr) {
return mr.getMark(); // using the default memory_order_seq_cst
}
mov rax, qword ptr [rdi]
and eax, 1
ret
void storeToRef(MarkableReference<int> &mr, int val) {
//(*mr.getRef(memory_order_relaxed)) = val; // less code on weakly-ordered CPUs like MIPS
(*mr.getRef()) = val;
}
mov rax, qword ptr [rdi]
and rax, -2
mov dword ptr [rax], esi
ret
void foo() {
mr_a.setMark(true, memory_order_relaxed);
}
lock
or qword ptr [rip + mr_a], 1
ret
void bar() {
mr_b = mr_a;
}
// no MFENCE since I decided to make operator= use memory_order_relaxed. acquire / release would also not need MFENCE on x86
mov rax, qword ptr [rip + mr_a]
mov qword ptr [rip + mr_b], rax
ret
// this one compiles to a cmpxchg loop and a branch :/
// but I think that's unavoidable
void baz() {
bool tmp = mr_b.xchgMark(false);
mr_a.setMark(tmp);
}
mov rax, qword ptr [rip + mr_b]
.LBB4_1: # =>This Inner Loop Header: Depth=1
mov rcx, rax
and rcx, -2
lock cmpxchg qword ptr [rip + mr_b], rcx
jne .LBB4_1
test al, 1
jne .LBB4_3
lock and qword ptr [rip + mr_a], -2
ret
.LBB4_3:
lock or qword ptr [rip + mr_a], 1
ret