I was asking exactly the same question a while ago.
I solved for my particular use case with a simple piece of code. Other use cases will have different optimum solutions.
Example Use Case 1: Hot loop that needs to check an atomic
Assuming your use case is something like this: a hot loop has to spin as fast as possible, and it can't afford to be continuously checking an atomic (or volatile) variable as this involves synchronising state between two processor cores.
The solution is remarkably simple: only check every 1024 iterations. Why 1024? It's a power of 2, so any MOD operators get optimised to a fast bitwise AND calculation. Any other power of 2 will do. Tune to suit.
After that, the overhead of an atomic becomes negligible compared to the work the loop is performing.
It would be possible to implement other more complicated solutions. But these lines are sufficient:
// Within a hot loop running on a single core ...
int32_t local = 0;
if (local % 1024 == 0) // Optimised to a bitwise AND (checks if lower N bits are 0).
{
// Check atomics, which may force the processor to synchronize state with another core.
}
Example Use Cases 2....N: The Realtime Bible
There is an excellent talk on various levels of locking for other use cases, see: Real time 101 - David Rowland & Fabian Renn Giles - Meeting C++ 2019.
Q. Is a lock without memory effects theoretically possible?
- A. Yes. If two threads were pinned to the same core, then both threads could share state using registers. If two threads are running on different cores, they could share state using L1, L2 or L3 cached memory if they are next to each other on the die. If two threads are running on different cores, then usually they communicate shared state using a flush to RAM.
Q. Would a lock without memory effects actually be more performant?
- A. Yes. However, this is really only applicable to embedded operating system or device drivers, see update below.
Q. Is there a built-in way to accomplish this in C++ and/or Java?
- A. No. But the answers above can get very close, often by avoiding locks altogether for most of the time. Better to spend time to master a good profiler.
Q. If there isn't, can it be implemented in C++ and/or Java?
- Q. No. Not realistic in a high level language, see update below.
Hardware interrupts are entirely distinct from software interrupts. A hardware interrupt causes the processor Intruction Pointer (IP) to switch to execute another service routine. So a lock without memory effects is theoretically possible if there are many "threads" running on a single core, and the "threads" (i.e. the Interrupt Service Routines triggered by hardware interrupts) communicate via registers in the CPU, or at the very least by an internal cache (L1, L2, L3), as this does not end up hitting RAM.
On a practical level, this is probably not relevant to any high level languages such as C++ or Java, and it is probably not relevant to user-mode processes in high-level operating systems such as Linux or Windows. It is probably only possible when using an embedded OS such as QMX, or perhaps when writing kernel-mode device drivers for Windows or Linux.
So in practice, a reasonable rule of thumb is to just assume that all locks have memory effects. If one is concerned about performance, run a profiler. If there are performance issues, choose from the selection of threading architectures in said Realtime Bible.