Does a plain read of a variable that is updated with interlocked functions always return the latest value?

Question

If you only change a MyInt: Integer variable in one or more threads with one of the interlocked functions, lets say InterlockedIncrement, can we guarantee that after the InterlockedIncrement is executed, a plain read of the variable in any thread will return the latest updated value? Yes, no and why?

If not, is it possible to achieve that in Delphi? Note that I'm talking about only one variable, no need to worry about consistency about two or more variables.

The root problems and doubt seems essentially equal to the one in this SO post, but it is targeted at C# there, and I'm using Delphi 2007, so no access to volatile, neither of newer versions of Delphi as well. In that discussion, two major problems that seems to affect Delphi as well were raised:

The cache of the processor reading the variable may not be updated.
The compiler may optimize the code in a way that causes problems to read.

If this is really a problem, I'm very worried to use even a simple counter with InterlockedIncrement, or solutions like the lock-free initialization proposed in here, and would go to just plain Critical Sections of MultiReaderSingleWritter for safety.

Initial analysis

This is what I've found so far, but fell free to address the problems in other ways if appropriate, or even raising other unknown problems so the objective of the question can be achieved:

For the problem 1, I expected that the "full-fence" would also force the cache of other processors to be updated... but reading around it seems to not be the case. It looks that the cache would only be updated if a "read barrier" (or whatever it is called) would be called on the processor what will read the variable. If this is true, is there a way to call such "read barrier" in Delphi, just before reading the variable? Full-fence seems to imply both read and write barriers, so that would also be ok. Since that there is no InterlockedRead function according to the discussion in the first post, could we try (just speculating) to workaround using something like InterlockedCompareExchange (ugh... writing the variable to be able to read it, smells bad), or maybe "lock" low level assembly calls (that could be encapsulated)?

For the problem 2, Delphi optimizations would impact in this matter? Any way to avoid it?

Edit: The solution must work in D2007, but I'd like, preferably, to not make harder a possible future migration to newer Delphi, and use the same piece of code in ARM as well (this became clear to me after David's comments). So, if possible, it would be nice if solution is not coupled with x86/64 memory model. Would be nice if I need only to replace the plain Windows.pas interlocked functions to whatever provides the same interlocked functionality in newer Delphi/ARM, without the need to review the logic for ARM (one less concern).

But, Do the interlocked functions have enough abstraction from CPU architecture in this case? Problem 1 suggests that it doesn't, but I'm not sure if it would affect ARM Delphi. Any way around it, that keeps it simple and still allow relevant better performance over critical sections and similar sync objects?

Delphi doesn't do any optimisations that could lead to you read a stale value. So long as your variable is aligned you are safe to use a simple read. As far as the processor aspects go, because Delphi 2007 is for Windows only, you can rely on the underlying strong memory model of x86. — David Heffernan, Jan 24 '18 at 14:26
@David: read/write aligned variables are atomic, but apart from the compiler, my other concern is the cache. So in case of an aligned variable, could you detail a little what cooperative behavior of x86 with the interlocked functions make a plain read safe? Or is it safe anyway just because of the x86 alone, even without using interlocked? Would it be safe for x64 as well (we may eventually upgrade) for the same reasons? — Thiago Linhares de Oliveira, Jan 24 '18 at 17:30
x86 and x64 strong memory model gives you the guarantees that you need — David Heffernan, Jan 24 '18 at 17:35
Just to make sure, when using aligned var in x86/x64 you still need to call the interlocked function when changing the value, right? Or in that case there is no need even to do that, in order to plain read the latest update? Otherwise, why would people bother to call interlocked, like in the example at the "lock-free initialization" link I posted? — Thiago Linhares de Oliveira, Jan 24 '18 at 18:09
Interlocked operations allow you to perform read, modify, write as a single atomic operation. Otherwise, for instance `inc(somevar)` is separate operations. If two threads perform it then the operations can interleave. If you have a single thread that writes, and multiple that read then you can use plain simple read and write. Always of course with aligned data. — David Heffernan, Jan 24 '18 at 18:27
And if there may be more then one thread writing, will x86/x64 also protect my plain writes/reads, or in that case I will need the interlocked to write? Also, if it does not fall out too away from the scope of the question, would be nice if you could provide just enough details about what in the x86/x64 make reading the lastest updated value possible and safe. — Thiago Linhares de Oliveira, Jan 24 '18 at 19:13
If you have multiple write threads, incrementing a counter, then you need an interlocked increment. As for the second part of that comment, read about strong memory model. A good place to start would be strong memory model as relates to double checked locking. — David Heffernan, Jan 24 '18 at 19:16
Thanks for the directions. I've found that x86/64 is "usually strongly-ordered", as stated in [link](http://preshing.com/20120930/weak-vs-strong-memory-models/). I'd like a second comment from somebody with more expertise on that than me. I'd like to avoid relying on x86/64 behavior with plain reads/writes if possible. Also, I indeed need a solution for D2007 now, but I'd like to not make harder a possible future usage in ARM. I did'n mention this initially because I thought using interlocked functions would abstract such details, which seems to not be the case from the C# post. — Thiago Linhares de Oliveira, Jan 25 '18 at 01:01
Edited the question, adding the concern for the possible future migration to newer Delphi and ARM support, if possible. — Thiago Linhares de Oliveira, Jan 25 '18 at 01:44
I'm sorry, but turning this into a long string of ever changing questions isn't much fun for anybody. Bye. — David Heffernan, Jan 25 '18 at 04:22
Anyway, on modern delphi the TInterlocked class and AtomicXXX intrinsics serve the same purpose. — David Heffernan, Jan 25 '18 at 07:04
This is a highly complex subject with so many pitfalls, so I felt that a bit more of details were at least desirable, if not necessary. I just edited the question once. — Thiago Linhares de Oliveira, Jan 25 '18 at 12:39
I checked the TInterlocked, and it has an interesting TInterlocked.Read (no documentation, not sure what it does). But since there is no out of the box solution for D2007/plain windows interlocked functions, I can accept your proposed solution given the D2007 context, although I'll try to avoid it due to the coupling. Thanks for the replies. — Thiago Linhares de Oliveira, Jan 25 '18 at 12:41
Fair enough. Sorry. I was rude. Its no problem. Your edit and comments were fine. It was my problem. — David Heffernan, Jan 25 '18 at 19:13
The cache coherence is handled by the processor and its intrinsics protocols (MESI, MSI, etc). You don't need to worry about cache line issues because programatically you can't change its behavior. Once you garanteed your reads and writes are performed in order the processor will make sure to reflect those changes into the main memory. This [book](https://www.amazon.com/Computer-Organization-Architecture-William-Stallings/dp/0134101618) has several information about caching — EProgrammerNotFound, Jan 27 '18 at 17:52

Does a plain read of a variable that is updated with interlocked functions always return the latest value?

0 Answers0