Does VarHandle 'getAndSet' uses 'xchg' instruction, and how does x86 'lock' prefix handle backpressure?

Question

While trying to understand how Java's VarHandle "getAndSet" worked as opposed to it's JDK7 version:

public final V getAndSet(V newValue) {
    while (true) {
        V x = get();
        if (compareAndSet(x, newValue))
            return x;
    }
}

I noticed something...

The form of backpressure handling done by JDK7 is not sequentially consistent, if a thread fails the compareAndSet, and retries, it may lose it's position to some thread that came later.

Even if this operation is deemed "atomic" an implicit queueing buffer is built on it's spin... and this buffer is not sequentially consistent, even if the atomicity is performed by the comapreAndSet action.

I am not entirely sure why JDK7 implements the test "compare"(in the comapreAndSet) to perform a unique swapping action... since when a person performs the physical action of swapping something, that person is not interested in comparing anything...

the compareAndSet employs the cmpxchg instruction, so... my immediate question was... why not use just the xchg instruction for the getAndSet?... by eliminating the test failure, a single buffering queue will now build up... held by the native LOCK.

And so, while I am not sure if VarHandle DOES in fact use the xchg alone (DOES IT?), it still remains the question whether the machine is handling the backpressure build-up in a sequentially consistent manner...

(Here comes the essay... sorry.)

(edit: At first I thought that with mfence link one could achieve this FIFO queue... but in reality there is no way for the processor to enforce any type of policy that defines such a FIFO queueing of "assigned scheduled instructions", instead the processor's scheduler: "may use different algorithms and policies to determine the order and priority of the threads, such as round-robin, shortest job first, priority-based, etc. Some of these algorithms may approximate a FIFO order, but they are not strictly enforced by the processor")

And there is no evidence that the lock prefix is doing so.

In fact the architecture may...:

The queuing mechanism used by memory controllers is typically managed at the hardware level and is influenced by the overall design of the memory subsystem, including cache hierarchies, memory interconnects, and bus arbitration protocols.

Instructions in modern processors are designed to perform specific computations and memory operations, rather than directly control low-level hardware details like queuing mechanisms in memory controllers. Processor instructions are more concerned with specifying the operations to be performed on data, such as arithmetic operations, memory loads and stores, control flow instructions, etc.

Please don't ask me who I'm quoting... I think you know "what" said this...

... On reactive systems... Or...[ Why NO native FIFO instruction queue? ]

It seems reasonable that the greater concern for the processor would be for "specifying the operations to be performed on data"(... what one would call "updates"), instead of defining queueing strategies.

The thing about updates is that an update will always relay the previous state, and so... the VERSION is implicit by the "form" of the current data... so if... let's say the operand says that the update is a i -> i++; we can surely assume that the previous version was current - 1;, this may imply that defining a queueing strategy is redundant since the end result will still apply all changes performed, and a version can be implied by each change simply by checking its current state.

But this is not how Version Based systems work (Publisher architectures aka: "Displays")... In Version based systems historical sequential consistency is the most important factor, especially since changes may completely disregard/override previous states leaving no way to figure out previous states... The output does NOT HAVE MEMORY

And such is the case of displaying events, where previous states are completely disregarded.

This is why I believe most reactive systems are doing too much when used in such menial tasks like web applications or mobile applications displays.

Their buffering management is seldom used, only in such specific cases like video streaming is it really needed, but if we get pedantic about meaning... even video streaming is somewhat limited by the speed our eyes process information, so there is a limit to the number of processes that need buffering (assuming good connection which I know is the main/real reason for this buffer).

Defining a queuing mechanism on the lock prefix would be a difference similar to the dynamic between "volatile" and "syncrhonized" keywords.

Where non-FIFO locks would be used in update actions where memory management would prioritize performance with reordering, thread priority, etc.. used on databases and streaming operations... where no calls are left behind.

While strict FIFO locks would just queue in order of arrival, used in publishing operations like displays, where dropping instructions is always an alternative to relieve backpressure.

So, one pipeline for writing/input (Stateful) and the other for reading/output (Stateless).

These read like two different questions, and the second half reads like an essay, not a question. — Louis Wasserman, Aug 22 '23 at 17:23
I think you're mis-using "not sequentially consistent". No other work (or memory operations at least) from this thread can start until after this retry loop finishes, assuming `compareAndSet` has Java's usual seq_cst ordering. So the total order across all threads will still be some interleaving of program-order of each thread, assuming the program didn't have data races on non-`volatile` variables. (Java gives you SC-DRF; non-`volatile` variables might or might not be visible across threads at all, depending on how it compiles, but if they are they're like C++ `memory_order_relaxed`) — Peter Cordes, Aug 22 '23 at 18:09
Hopefully that pure-Java code is just a placeholder and JVMs will use `xchg`. It would be really inefficient to use a CAS retry loop to implement exchange on any hardware, especially ones like x86 that have a single-instruction atomic swap. But I don't see any correctness problem. The key time is when an operation completes, not when it starts trying. In the retry-loop case, some of the time waiting is software retries instead of just hardware waiting for a turn to own the cache line, but same difference in terms of memory ordering. — Peter Cordes, Aug 22 '23 at 18:11
Thanks @PeterCordes for your time. Does this ```seq_cst``` ordering maintains sequential order even after the "jump" the while loop applies on the thread? So, order consistency is maintained EVEN in the face of undetermined behavior like: failing the CAS and **retrying**... like... the entire loop is "fenced" from reordering?? — Delark, Aug 22 '23 at 18:37
Each CAS attempt is an SC operation. (Well, I'm assuming that, but if so...). They can't reorder with each other, and a whole sequence of them in one thread is not different from one. The failed CAS attempts have no visibility for other threads, and the successful CAS that leads to exiting the loop is exactly like a seq_cst `exchange` would have been. The loop branch is internal to this thread, part of its program order, so IDK what misconception you have about memory ordering that makes you think it might be relevant. — Peter Cordes, Aug 22 '23 at 18:45
I was talking about the order of the threads, not the memory consistency across the threads, and I thought that xchg's implicit lock (which according to [this](https://stackoverflow.com/questions/40409297/does-lock-xchg-have-the-same-behavior-as-mfence?rq=2) has a weaker ordering enforcing strategy than mfence),... would handle backpressure with an orderly queue. I am talking from the perspective of the memory, not the thread. — Delark, Aug 22 '23 at 19:10

Quân Anh Mai · Accepted Answer · 2023-08-23T02:33:05.370

1

There is no behavioural difference between a thread failing the CAS and retrying vs a thread being preempted just before the exchange and getting scheduled later, so this code is equivalent in correctness with an exchange operation.

Sequential consistency is a total order over all the operations involved, no more and no less. There is no guarantee that thread A arriving at operation n before thread B arriving at operation m, will execute its operation n + 1 before the B executes its m + 1, the only implication is that thread A will perform operation n before thread B executes operation m + 1. (And a program that just refuses to do any action is a conformant program, albeit a useless one)

The reason for this implementation is the ease of maintenance, there is no need to have the maximum performance in C1 and interpreter code, and converging all operations to compare and exchange reduces the amount of JNI code involves. And when the code gets into C2, getAndSet is an intrinsics that will be translated into a single xchg instruction.

edited Aug 23 '23 at 02:33

answered Aug 23 '23 at 02:27

Quân Anh Mai

396
2
6

Thank you for the response. While I agree that there is no difference, the reason for this similarity in behavior of non-sequential consistency (based on the information I've gathered... I'm not very good at this.) is that the processor does not seem to have an available policy that enforces a strict FIFO queue for scheduled instructions and instead, employing algorithms that enforce a priority-based order for all instructions made to "wait" on a LOCK, hence, delegating this type of scheduling management to higher levels... The information available never talks about this waiting queue ... – Delark Aug 23 '23 at 04:08
in "buffer" terms or even in "queue" terms... it just says, "waiting instructions"(plural) so I assume that the processor is ALSO spin-locking around the locked memory until it becomes available (brave assumption I think...). It makes complete sense for the JNI to try to generalize everything as much as possible, so I think I would've done the same thing as to reduce the number of instructions to a single cmpxchg. – Delark Aug 23 '23 at 04:09

Does VarHandle 'getAndSet' uses 'xchg' instruction, and how does x86 'lock' prefix handle backpressure?

1 Answers1