Create Futex function using nop instruction or even create own futex

Question

I have a question about making a thread lock method ... I'm not talking about pthread lock or any other lock ... I'm talking about atomic operations and finally, futex (on Linux (we are talking about Linux only)) ... I know there is a function named futex and it takes a UINT32 memory address and performs wait and wake without and its waiting is without involving CPU ... In fact, I know there is a different between a loop + atomic operations and atoimc, then futex ... the first one involving CPU (CPU usage) and the second one doesn't ... I thought about futex and I know there is an instruction in Assembly named nop which has no CPU involving, but it comes with 1 latency so in my mind I thought maybe the futex is something like about a loop with for example, 2000 nop operations and at the end, it just jumps back to the top and checks if our condition is right ... and if it doesn't, runs that 2000 nop operations again and this continues, until the condition become right ... Am I right about futex? Or if I'm wrong about how futex works, is it possible to create a thread-lock function with this behaviour ? Will be a good thing ?

UPDATE

I created a Linux executable ... a loop with 30000 nop inside it and a condition check at the top of the loop ... but when i execute the program, my CPU usage went to 99% !!!

@ecm i read this some where : PAUSE may actually stop CPU for some time to save power. Older CPUs decode it as REP NOP, so you don't have to check if its supported. Older CPUs will simply do nothing (NOP) as fast as possible. — HelloMachine, May 28 '22 at 10:39
Futex is a _system_ call. When you call it with `FUTEX_WAIT`, you are asking the OS scheduler to swap the calling thread out and, to keep it swapped out until some other thread makes a matching call with `FUTEX_WAKE`. There is no way that un-privileged code can interact with the kernel's data structures like that. Only a system call can do. Also, you should think of NOP as an instruction that _uses_ the CPU to do nothing. — Solomon Slow, May 28 '22 at 12:10
FYI: One use for a documented NOP instruction, more commonly used back in olden days, was to allow binary code to be "patched." A string of NOPs could be overwritten with instructions that actually do something, or a string of other instructions could be overwritten with NOPs. — Solomon Slow, May 28 '22 at 12:18
@ecm AFAIK `pause` does not really idle the CPU in practice. At least not on Intel processors like mine (i5-9600KF). The turbo is still active (highest frequency) and the CPU is fully busy waiting. I could reproduce this on other older platforms including Intel server processors. However, It can clearly save power as indicated by HelloMachine. On my machine, it reduce the power usage and results in a much smaller heat dissipation (cut by half). I wonder why Intel did not choose to reduce the frequency or use C-states. — Jérôme Richard, May 28 '22 at 13:25
Re, "...`pause`..." Most CPUs have a privileged instruction that puts the CPU into a low-power state where it stops executing until an interrupt occurs. But, what "low-power" means depends on the design. In tiny, fully static CPUs, a wait-for-interrupt instruction can stop almost everything, and the power consumption drops down to microwatts. Bigger CPUs, in workstations, servers, and mobile devices, are at least partly based on [_dynamic_ logic](https://en.wikipedia.org/wiki/Dynamic_logic_(digital_electronics)), and cannot be _fully_ stopped without losing data and/or state. — Solomon Slow, May 28 '22 at 14:44
Some CPUs have multiple levels of "low power" mode. The OS is responsible for saving state that will be lost when the CPU goes into the deeper levels, and then restoring the state when the CPU is woken up again. — Solomon Slow, May 28 '22 at 14:48
@JérômeRichard: The use-case for `pause` is for very *short* spin-waits, to let the other logical core have more front-end/back-end resources. (Skylake bumped it up from ~5 cycle front-end stall to ~100 cycle). Also avoiding memory-order mis-speculation when a load result changes in a spin-wait loop. It's not intended for longer low-power sleeps, and changing frequency takes about 20k cycles on Skylake ([Lost Cycles on Intel? An inconsistency between rdtsc and CPU\_CLK\_UNHALTED.REF\_TSC](https://stackoverflow.com/q/45472147)), at least if voltage changes, too. — Peter Cordes, May 28 '22 at 17:28
@JérômeRichard: You normally would *not* want to trigger a drop in CPU frequency on the first execution of a spin-wait loop, since you'd likely end up waiting way longer than you wanted. (Perhaps a total of about the inter-core latency to optimize for another core having taken a lock and released it almost right away, checking occasionally during that.) For your use-case of lower-power sleep in user-space instead of giving the CPU back to the OS, Intel introduced `tpause` (timed pause) and `umonitor` / `umwait` with the WAITPKG extension, first in Tremont, probably in big-core CPUs by now. — Peter Cordes, May 28 '22 at 17:34
@JérômeRichard: Brendan says the same thing on: [Why have AMD-CPUs such a silly PAUSE-timing](https://stackoverflow.com/q/69331846) — Peter Cordes, May 28 '22 at 17:35
@HelloMachine: `nop` doesn't have any latency; it has no inputs or outputs. It only has a throughput, ranging from 4/clock to 6/clock depending on the CPU. https://uops.info/. And yes, very old CPUs `pause` as `nop`, but that doesn't make actual `nop` a good choice. New CPUs still run `nop` as `nop`, only `pause` as `pause`. — Peter Cordes, May 28 '22 at 17:37
@PeterCordes So my idea is correct for lock implemention (making 2000 `nop` or `pause` to make delay (without CPU usage)) ? — HelloMachine, May 28 '22 at 18:00
No, [Solomon Slow's first comment](https://stackoverflow.com/questions/72414695/create-futex-function-using-nop-instruction-or-even-create-own-futex#comment127927373_72414695) already explained why that's fundamentally impossible. If you want to yield the CPU back to the OS (so your process isn't "using" CPU time), you need to make a system call. — Peter Cordes, May 28 '22 at 18:02
@PeterCordes for example, in a database, there are 1000 writes in same time on one record ... Are databases uses syscalls like futex too for that record ? or they just use spin-lock (2 rounds `lock cmpxchg` + `pause`) because they know that write will be quick ? i want to know that i can trust spin-lock for small write or not ... on a record with 1000 writes in same time ... or even they use both methods (spin-lock, pause + futex) ? — HelloMachine, May 28 '22 at 18:21
@HelloMachine: A common strategy for locking is to spin a few times in user-space to see if the lock becomes available soon (checking **read only** with `pause` and `cmp`/`jcc` to reduce contention, like [this toy example](https://stackoverflow.com/questions/37241553/locks-around-memory-manipulation-via-inline-assembly/37246263#37246263)). But after a few (dozen or hundred) spins then use a `futex` system call to let the OS schedule something else on that core, and wake this thread when the lock is available. The toy example I linked does *not* do that fallback, just spins forever. — Peter Cordes, May 28 '22 at 18:33
Of course, if there are 1000 writes on a single record as part of the same transaction, it would be worthwhile for a database to try to batch that up into one actual update to the record. Unless you're running on a 1000 core machine, there can't actually be 1000 simultaneous attempts to take the lock. If you had 1000 software threads, at least some of them wouldn't be running at the moment when other threads all tried to take the lock. — Peter Cordes, May 28 '22 at 18:36
@HelloMachine: BTW, I'm not sure if spinning in user-space before using a system call is always the best strategy. IIRC, the standard library on some system (possibly MacOS) just calls into the kernel right away. But that's a tuning choice that can differ by use-case. A database with typically-short transactions and typically low contention between threads (fine grained locking and usually not touching the same record) might want to spin, but if it's common for many threads to be waiting for the same lock, probably best not to have them all wasting CPU time "optimistically" hoping to win. — Peter Cordes, May 28 '22 at 18:39
@PeterCordes on MacOS, there is a definition `LOCK_SNOOP_SPINS` ... it's for the number of spin-lock loop ... it's 768 `(0x300)` ... so in MacOS, the spin-lock loop size is 768 (times) ... so i think i must choose something like 512 times ... right ? — HelloMachine, May 28 '22 at 18:50
The right tuning choice depends on your use-case and maybe what workload and kind of machine you're optimizing for. 512 `pause` iterations sounds like a lot for modern Intel where it pauses about 100 cycles, but reasonable for earlier CPUs (and current AMD) where it's more like 5 cycles of front-end stall per `pause`. — Peter Cordes, May 28 '22 at 18:53

Create Futex function using nop instruction or even create own futex

0 Answers0