Why have AMD-CPUs such a silly PAUSE-timing

Question

I've developed a monitor-object like that of Java for C++ with some improvements. The major improvement is that there's not only a spin-loop for locking and unlocking but also for waiting on an event. In this case you don't have to lock the mutex but supply a predicate on a wait_poll-function and the code repeatedly tries to lock the mutex polling and if it can lock the mutex it calls the predicate which returns (or moves) a pair of a bool and the result-type.

Waiting to for a semaphore and or a event-object (Win32) in the kernel can easily take from 1.000 to 10.000 clock-cylces even when the call immediately returns because the semaphore or event has been set before. So there has to be a spin count with a reasonable relationship to this waiting-inteval, f.e. spinning one tenth of the minimum interval being spent in the kernel.

With my monitor-object I've taken the spincount recalculation-algorithm from the glibc. And I'm also using the PAUSE-instruction. But I found that on my CPU (TR 3900X) the pause instruction is too fast. It's about 0,78ns on average. On Intel-CPUs its much more reasonable with about 30ns.

This is the code:

#include <iostream>
#include <chrono>
#include <cstddef>
#include <cstdint>
#include <immintrin.h>

using namespace std;
using namespace chrono;

int main( int argc, char **argv )
{
    static uint64_t const PAUSE_ROUNDS = 1'000'000'000;
    auto start = high_resolution_clock::now();
    for( uint64_t i = PAUSE_ROUNDS; i; --i )
        _mm_pause();
    double ns = (int64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count() / (double)PAUSE_ROUNDS;
    cout << ns << endl;
}

Why has AMD taken such a silly PAUSE-timing ? PAUSE is for spin-wait-loops and should closely match the time it takes for a cacheline-content to flip to a different core and back.

PAUSE times on Intel CPUs vary wildly based on the generation btw (with more than an order of magnitude difference), we can't just say it's 30ns — harold, Sep 26 '21 at 04:26
Intel before Skylake was only about 5 clock cycles; they raised it to 100 in SKL after their experiments found that led to better throughput. (Especially considering competition with another hyperthread.) — Peter Cordes, Sep 26 '21 at 05:13
*should closely match the time it takes for a cacheline-content to flip to a different core and back.* - You might want to save some power (and avoid memory_order machine nukes) without waiting too long by on average half that interval (if you assume timing on losing the cache line is random, which of course it might not be with high contention). So shorter pause makes sense. Not this short, though; only a couple clock cycles seems a bit silly I'd agree. — Peter Cordes, Sep 26 '21 at 05:16
I'd guess that you would have to ask AMD to explain any decisions they have made. SO is not going to be able to answer that question for them. — Ken White, Sep 26 '21 at 05:42

Brendan · Answer 1 · 2021-09-26T12:29:52.570

But I found that on my CPU (TR 3900X) the pause instruction is too fast. It's about 0,78ns on average. On Intel-CPUs its much more reasonable with about 30ns.

The pause instruction has never had anything to do with time and is not intended to be used as a time delay.

What pause is for is to prevent the CPU from wasting its resources (speculatively) executing many iterations of a loop in parallel; which is especially useful in hyper-threading situations where a different logical processor in the core can use those resources, but also useful to improve the time it takes to exit the loop when the condition changes (because you don't have "N iterations" of instructions queued up from before the condition changed).

Given this; for an extremely complex CPU that might have 200 instruction in flight at the same time, pause itself might happen instantly but cause a "200 cycle long" pipeline bubble in its wake; and for an extremely simple CPU ("in order" with no speculative execution) pause may/should do literally nothing (treated as a nop).

PAUSE is for spin-wait-loops and should closely match the time it takes for a cacheline-content to flip to a different core and back.

No. Assume the cache line is in the "modified" state in a different CPU's cache and the instruction after the pause is something like "cmp [lock],0" that causes the CPU to try to put the cache line into the "shared" state. How long should the CPU waste time doing nothing for no reason after the pause but before trying to put the cache line into the "shared" state?

Note: If you actually need a tiny time delay, then you'd want to look at the umwait instruction. You don't need a time delay though - you want a time-out (e.g. "spin with pause; until rdtsc says an certain amount of time has passed). For this I'd be tempted to break it into an inner loop that does "pause and check for condition N times" then an outer loop that does "retry inner loop if time limit not expired yet".

If you have `umwait`, other new waitpkg instructions including [`tpause`](https://www.felixcloutier.com/x86/tpause) are available (pause until a given TSC deadline, so you'd set up for it by running `rdtsc` and adding). If you want to be checking the spin condition at intervals during that deadline, you could maybe add increments in the loop. (Although if scheduling means you start a `tpause` after the TSC has passed your deadline, the wait time would be decades(?) until the 64-bit TSC wraps back, limited only by the OS's setting for the MSR `IA32_UMWAIT_CONTROL[31:2]`.) — Peter Cordes, Sep 26 '21 at 13:11

Why have AMD-CPUs such a silly PAUSE-timing

1 Answers1

Linked