What is the purpose of the pause x86 instruction and the loop in the GetSystemTimeAsFileTime function disassembly?

Question

I was researching the workings of the GetSystemTimeAsFileTime function to see how optimized it is internally and stumbled upon this seemingly simple set of instructions (in my Windows 10 system):

So as you see that function simply reads from 3 global variables: 2 are the system time (or FILETIME) and then compares the lower DWORD to the 3rd variable and if they are the same it loops around until they are not.

As far as I know the pause instruction is needed to halt the CPU for a short period of time to save power. Is that why it's there?

Also, if that is the case, why are they not using a critical section to prevent thread synchronization issues with those 3 global variables?

And lastly, what's the purpose of that loop?

EDIT: Interesting find. I looked into the x64 implementation and it is way more simple. It's just this:

@MichaelPetch: Hmm. Where do you see it? There's nothing `critical` or `mutually exclusive` in those `mov` instructions. — c00000fd, Dec 09 '17 at 03:08
0x76cbc4B7 is the start of the loop, the `cmp eax, ecx` at 0x76cbc4BD is the exit condition of the spin wait loop and the loop back occurs at 0x76cbc4D2 just after the `pause`. Thus 0x76cbc4B7 is the top of the spin wait loop and 0x76cbc4D2 the bottom . The code that runs when EAX and ECX become zero (end of the spin wait) is at 0x76cbc4C1 that eventually returns. ESI, EDX and EBX contain the addresses of the global variables. — Michael Petch, Dec 09 '17 at 03:13
I found some [specifics about exactly how this particular synchronization for the timer](https://www.dcl.hpi.uni-potsdam.de/research/WRK/2007/08/getting-os-information-the-kuser_shared_data-structure/) works. You may wish to read this especially the section _Windows Times_ which specifically defines what a function wanting to retrieve the timer values must do to get a time value detailing the synchronisation algorithm. — Michael Petch, Dec 09 '17 at 04:06
If you read that article and look at your code - the 3 globals (In the Shared User Data area) are part of the kernel system timer structure. 0x7ffe0014 is `LowPart`, 0x7ffe0018 is `High1Time` and 0x7ffe001c is `High2Time` — Michael Petch, Dec 09 '17 at 04:07
@MichaelPetch Good link. Thanks. I'll check that out. PS. Although your assessment of spin wait locks is incorrect. The way to implement a lock on an x86 instruction is with the `lock` prefix/opcode. Simply looping around won't implement a mutually exclusive section. I'll need to see in that article if there's a mentioning of some other "trick" that they could've implemented. — c00000fd, Dec 09 '17 at 04:13
You should read the link I gave and you'll understand why this is a type of spin wait lock. You don't need a lock prefix depending on how the synchronization works. In this case `lock` is avoided because it would require the system timer interrupt to block.Rather than havin the system timer interrupt block on a `lock` they devised another mechanism. This one requires the person retrieving the time to compare all 3 parts in a specific order and if `High1Part` and `High2Part` are equal then the values in `High1Part` and `Low1Part` are okay — Michael Petch, Dec 09 '17 at 04:16
The trick is in the strict ordering in which the 3 time fields *must* be written to and read from. The timer interrupt service and the loop used by `GetSystemTimeAsFileTime()` follow that ordering, to ensure that `GetSystemTimeAsFileTime()` has obtained an accurate and consistent time, without actually locking anything, before it can exit. — Remy Lebeau, Dec 09 '17 at 04:16
@RemyLebeau: Hmm. Yeah. I just read it. So I guess if you rely on x86's `mov r32, [m32]` instruction being atomic. I didn't know that it was though, especially on a multi-CPU system? — c00000fd, Dec 09 '17 at 04:24
On x86 a 32-bit `mov` to and from a memory location that is 32-bit aligned (natural alignment) will be atomic. Alignment is critical with a 32-bit `mov` as atomicity isn't guaranteed if it isn't naturally aligned. — Michael Petch, Dec 09 '17 at 04:35
I decided to search for a good description of alignment and atomicity on x86/x86-64 hardware. @PeterCordes wrote a very good Stackoverflow answer dealing with this subject: https://stackoverflow.com/a/36685056/3857942 — Michael Petch, Dec 09 '17 at 04:42
@MichaelPetch: Yeah, if that's the case it might work. Although take a look at x64 code (I updated my OP.) No such loop there. It just reads the entire 64-bit `FILETIME` in one quick swoop and bam, done! I assume that for x64 16-byte alignment also applies to atomicity, hah? — c00000fd, Dec 09 '17 at 04:47
Yep, as it should. In x86-64 a single `mov` instruction can move a complete `qword` (64-bits) atomically from/to memory as long as it is naturally aligned (64-bit value has to be 8 byte aligned in memory to guarantee atomicity). The regular `mov` instructions in 32-bit code can only deal with a maximum of 32-bit values. Since all 64-bits of `Low1Part` and `HighPart1` can be written atomically in a single instruction by the interrupt and read atomically in a single `mov` then the `high2part` isn't needed at all (high2part was solely or synchronization purposes) — Michael Petch, Dec 09 '17 at 04:50
@MichaelPetch: Ok. Thanks. I see now. It was indeed a clever trick in case of an x86 code. So I learned something today. Although for x64 code, is it 8-byte or 16-byte alignment needed for atomicity? I thought it was 16. — c00000fd, Dec 09 '17 at 04:56
x86 requires natural alignment (alignment = size of data being moved). So to guarantee atomicity you need to align a value so that its alignment is equal its size. So a 16-bit value is atomic if it is 16-bit aligned, 32-bit value is atomic if 32-bit aligned, 64-bit value is 64-bit aligned etc. — Michael Petch, Dec 09 '17 at 05:01
@MichaelPetch: It's a user-mode code. No `syscall` trip to the kernel there. In a sense it's a very fast implementation of that API. I'm actually (for once) impressed with Microsoft. — c00000fd, Dec 09 '17 at 05:01
@MichaelPetch: Oh, I see. In either case, using the FPU (or whatever they call it now, I think you meant to say SSE type registers, `XMMn`, etc.) would be much slower. If you think about it, the system time interrupts happen quite frequently, so that `halt` loop would almost never be triggered. For that a user-mode code has to be calling `GetSystemTimeAsFileTime` in a quick succession, which kinda defeats the purpose. That is why what they did would be probably the most efficient implementation. — c00000fd, Dec 09 '17 at 05:10
@MichaelPetch: Yes, like I said, I misspoke. (And can't edit it out now.) As for `fild` and `fisp` instructions, I didn't even know that they still exist now. I thought Intel deprecated them in PIII or something. (But, you're right, we're going off topic here.) In either case using the FPU or SSE registers would be an overkill. If I were to approach this head-on I'd use a familiar `lock xchg` instruction to implement a critical section, but that would not be as efficient as what they have done. That is why my hat goes off to that obscure/unnamed Microsoft engineer who coded it. — c00000fd, Dec 09 '17 at 05:20
You can't `xchg` because you don't have write access to that page. — Anders, Dec 09 '17 at 11:49
@MichaelPetch: Why don't you put your comments into a separate answer? There's a lot of speculation here now, your answer with that article link was spot on. — c00000fd, Dec 09 '17 at 21:23
After rereading Daniel's answer (I hadn't seen the modified one since last night), I think if he added the link it would suffice for the algorithm portion if he also explained the strict ordering required. Bee's answer better explains `pause`. No need for a 3rd answer, and since I'm not on SO for rep I don't mind others (or even you) answering the question. — Michael Petch, Dec 09 '17 at 21:45

BeeOnRope · Answer 1 · 2017-12-10T16:41:28.520

4

As Daniel's answer explains, this simply a way of implementing an atomic 64-bit read using two 32-bit reads in the case that atomic 64-bit operations aren't available or it is not desirable to use them for some reason.

About the pause instruction specifically, it is there for the rare time that the user-land code reading the counter happens to hit the exact moment where the kernel is updating them. At this point, it wants to "wait" until the kernel is done updating it, since it can't proceed until that occurs, but immediately reading the values again might be counter-productive since the writing and reading code will be fighting over the involved cache line.

The pause instruction is useful here since it inserts a small delay, is a single instruction, and also hints to the CPU that we are in a type of spin-wait loop and not to further speculate memory reads. It also allows another thread running on the same core (on the other "hyperthread") more use of the core's execution resources. In this case, the other thread could very well be the kernel thread trying to complete the write, so this makes a lot of sense.

The code would work without the pause so this is just a minor performance optimization. The vast, vast, majority of the time this path isn't even taken so the overall effect is probably microscopic¹.

¹ In the comments it was mentioned that the high part changes every 7 minutes, so the chance of hitting the race where we need to retry is really, really small. Let's conservatively assume that the gap between the two writes on the kernel size is 10 ns (about 30 cycles on a typical machine), I calculate the probability of any given read hitting the race at about 1 in 40 billion. Given that, you could make a reasonable argument that the pause is perhaps a pessimization here - any extra instructions in this slow path might not pay off in terms of code-size vs benefit (although here they seem to have put the slow path into its own cache line, so extra bytes could be "free" there).

edited Dec 10 '17 at 16:41

answered Dec 09 '17 at 20:57

BeeOnRope

60,350
16
207
386

Thanks. But, where did that "one every 7 minutes" come from, did anyone test it? – c00000fd Dec 09 '17 at 21:14
Also what's important is the order at which those 3 `DWORD`s are updated and read. Otherwise this `trick` will not work. – c00000fd Dec 09 '17 at 21:24
That really doesn't matter here. – c00000fd Dec 09 '17 at 21:24
@c00000fd - I'm confused, what doesn't matter? – BeeOnRope Dec 09 '17 at 21:25
There are 10^7 100 nanosecond intervals per second. 2^32/10^7/60 = ~7minutes? – Michael Petch Dec 09 '17 at 21:32
@c00000fd - it came from [this comment](https://stackoverflow.com/questions/47724837/what-is-the-purpose-of-the-pause-x86-instruction-and-the-loop-in-the-getsystemti/47733131?noredirect=1#comment82416860_47726757) and Michael shows the calculation in his comment immediately above. – BeeOnRope Dec 09 '17 at 21:33
@c00000fd - yes, the important part is the order of the reads and writes for this to work, but did I imply otherwise anywhere? I only put this answer to cover a bit more about the reason `pause` is there since the overall trick is explained well in the other answer. – BeeOnRope Dec 09 '17 at 21:34
I don't think the `pause` is a pessimization: without it, the CPU could recover from the branch miss and reuse the still-cached still-torn values to re-enter the loop if the invalidate from the writing thread hadn't arrived yet. And then maybe suffer a memory-ordering mis-speculation when it does arrive. (Hmm, actually probably not a memory-ordering mis-speculation; all 3 values are in the same line. But it definitely prevents this CPU from sending another read request while the IRQ handler is still trying to commit the 3rd store, if we flipped it to Shared after just the first.) – Peter Cordes Dec 10 '17 at 01:38
@PeterCordes - it's arguably a pessimization, not because it is worse when that path is executed (as I explain in this answer), but because if it only happens one out of every 40 billion calls it's hard to see even an extra byte of code being worth it in the code-size <-> speed tradeoff. Also, when `pause` increased in latency by an order or two of magnitude, it isn't even clear if it's worth it on the slow path, since the expected time for the "lock" to become available is very short. – BeeOnRope Dec 10 '17 at 04:48
... although they did put the slow path on its own cache line in the above listing, so maybe it doesn't cost anything. – BeeOnRope Dec 10 '17 at 16:41

Daniel Sęk · Answer 2 · 2017-12-09T10:11:57.683

2

x64 implementation reads 64 bit values atomically (if correctly aligned).

Classic x86 instructions can't do this.

Lets assume we have values (these are example values, not real):

low part:  0xffffffff
high part: 0x00000001

Because reading of two 32 bit values can be split by timer interrupt, there is small possibility of reading partly old and partly new value. If after interrupt we have:

low part:  0x00001111
high part: 0x00000002

We could end with wrong values:

low part:  0x00001111
high part: 0x00000001 <- WRONG

It looks that timer handler writes high part in two memory locations. This allows user code to detect overflow from low part to high part and initiate rereading of time. Thanks to this code there is no need for switch into kernel mode, time reading can be done in user mode.

It could be done with more advanced SSE instructions, but probably there is not point in changing working code.

PAUSE—Spin Loop Hint Instruction
Improves the performance of spin-wait loops. When execut ing a “spin-wait loop,” processors will suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE inst ruction be placed in all spin-wait loops. An additional function of the PAUSE instruction is to re duce the power consumed by a processor while executing a spin loop.

Intel® 64 and IA-32 Architectures Software Developer’s Manual

edited Dec 09 '17 at 10:11

answered Dec 09 '17 at 08:28

Daniel Sęk

2,504
1
8
17

32-bit x86 [can (on Pentium or later) atomically load/store 64 bits](https://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-naturally-aligned-variable-atomic-on-x86), but only with MMX / SSE2 `movq` (or SSE1 `movlps`) or x87 `fild`/`fistp`. `gcc -m32` uses this for `std::atomic`. Or with `lock cmpxchg8b`, which is more expensive than doing three 32-bit stores or loads. – Peter Cordes Dec 09 '17 at 09:04
`pause` is not a write barrier, and that's not why it's there. It's only there for performance (avoiding memory ordering mis-speculation on leaving the loop) and for power saving and to avoid stealing cycles from the other hyperthread while spinning. – Peter Cordes Dec 09 '17 at 09:07
Is the writing side in the kernel? That would explain why they only use integer instructions; saving/restoring user-space FPU state takes extra work. With infrequent writes (compared to CPU speed) the read side very rarely has to retry, and then only under extraordinary circumstances would it have to retry more than once. – Peter Cordes Dec 09 '17 at 11:30
I think high part changes once per about 7 minutes in normal ciricumstances (it could be also result of manual change of system time or synchronization with external source). This is really slim chance. – Daniel Sęk Dec 09 '17 at 11:44
Writting side must be on kernel side. In user mode you don't know when interrupt are handled, so you don't have any chance to update time values. Teoretically you could keep separate thread that somhow would get time from kernel or use timer procedures, but this would require each application to track time individually. – Daniel Sęk Dec 09 '17 at 11:59
I didn't know what time value this function was reading. The same lockless algorithm would work for a producer in another user-space thread. I guessed but wasn't sure it was the system-wide timestamp updated by the kernel's interrupt handler. (@c00000fd: so a critical-section with a lock isn't even an option here.) BTW, if 32-bit user-space code knows it's running under a 64-bit kernel, then it can trust that all 64 bits are written atomically (unlike with a 32-bit kernel), and it could use a 64-bit SSE2 atomic load. (A 64-bit kernel implies SSE2 support) – Peter Cordes Dec 09 '17 at 15:30
1

FWIW the generalization of this technique of lock-free reads is called a [seqlock](https://en.wikipedia.org/wiki/Seqlock). The article indicates that it is somehow Linux-specific, but it's not (perhaps that's where the name first came from, but I'm not even sure that's the case). It works only on platforms with "strong enough" memory ordering, in particular that reads don't reorder with reads (for the read side: the write side requires no store-store reordering to be lock free, but this is usually less critical and often you lock on the write anyways). – BeeOnRope Dec 09 '17 at 16:13
1

@BeeOnRope: Very interesting. Thanks for sharing. Now we know the name for the technique and possibly the engineer who came up with it. I think this statement applies here: `"The first implementation was in the x86-64 time code where it was needed to synchronize with user space where it was not possible to use a real lock"`. I'm not sure if Microsoft swiped it from a *NIX system, or not. It's just some interesting way to implement the lock (in my opinion) in those special situations. – c00000fd Dec 09 '17 at 20:20
1

Also it maybe time to update that Wikipedia article, since it's clearly used in Windows as well. (Although this one is the latest version of Win10. Which does support partial *NIX native subsystem.) I'm not sure if it was the same for earlier versions of Windows. If time permits, I'll check it later today. – c00000fd Dec 09 '17 at 20:22
@c00000fd - yeah, interesting that the original use (per the wikipedia article) was exactly the same. Note that in the "general case" use the sequence numbers are separate from the data, but in this special case the two values are effectively used as both the data _and_ the sequence count (since this possible since the "time" value already has the characteristic needed to be used as a sequence count). BTW I'd be skeptical of any claim that one specific engineer uniquely came up with this idea. I've seen the trick a few times in various places and I doubt they were all "copied". – BeeOnRope Dec 09 '17 at 20:26
Oddly you'll find very little about this "lock" on the internets, even though it's quite powerful. I think Java's [StampedLock](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/StampedLock.html) is the same thing, and kind of indicates the name hasn't been standardized. The word "lock" is a bit misleading, it's more like a special type of atomic concurrent array with lock-free reads. Probably part of the reason for the unpopularity is that you can't build this lock with the "standard" memory models for C++ or Java based on acquire-release. – BeeOnRope Dec 09 '17 at 20:26
@BeeOnRope: Sure. I was just generalizing. I personally have never seen it used (or maybe I wasn't paying attention.) And yes, it's not a lock-lock, more of a `synchronization trick` that won't apply everywhere. – c00000fd Dec 09 '17 at 20:40
@BeeOnRope: It's lockless, but not lock-free in the computer-science jargon sense. If the producer sleeps at the wrong time, the value will be unreadable until it wakes. (Not an issue when the writer is a kernel interrupt handler; it can't sleep.) – Peter Cordes Dec 09 '17 at 21:39
@PeterCordes - right, but it's lock-free (actually wait-free) from the point of view of the _readers with respect to other readers_, which I guess is what matters, performance-wise, for this kind of scenario. No reader can affect another reader. You are right though that it is not lock free overall: writers can block writers and writers can block readers. It could potentially be "lock-free" if writers could do atomic 64-bit writes. That makes it interesting for platforms that have 64-bit writers which aren't guaranteed to be atomic, but are on most/all platforms. – BeeOnRope Dec 09 '17 at 21:41
Then you have a structure that is _lock-free_ in practice, but still correct even on some hypothetical platform where 64-bit writes aren't atomic (see for example the case of non-atomic 128-bit writes on some AMD archs for where this kind of behavior could be useful). – BeeOnRope Dec 09 '17 at 21:43
@BeeOnRope : I also think it is important that the mechanism used doesn't actually make the interrupt routine block unnecessarily if it doesn't have to. – Michael Petch Dec 09 '17 at 21:51
1

@MichaelPetch - exactly. I think this "trick" was used here and in the Linux timekeeping case not so much because someone decided the userspace read path should be _fast_ but because having locks shared between user and kernel space (especially when accessed in an interrupt) is a definite no-no, and the read-path speed is a nice side effect. Still, the trick is still pretty useful just for the performance aspects in cases where the reader(s) and writer(s) are both in userland. – BeeOnRope Dec 09 '17 at 21:53

What is the purpose of the pause x86 instruction and the loop in the GetSystemTimeAsFileTime function disassembly?

2 Answers2