The PAUSE instruction may not work in my test

Question

the detail of test program

I tested it in user mode using examples from Intel SDM, The details of this test are as follows:

Provide an global lock using SystemV shared memory for multiple processes use it
Use the assembly example in Intel SDM as the main code for lock grabbing:

int get_lock()
{
        int locked_val = 1;
        __asm__(
        "Spin_Lock: \n"
        "cmpq $0, (%[global_lock_addr]) \n"
        "je  Get_Lock \n"
        FILL_INST "\n" //FILL_INST is "nop" or "pause"
        "jmp Spin_Lock \n"
        "Get_Lock: \n"
        "mov $1, %[locked_val] \n"
        "xchg %[locked_val], (%[global_lock_addr]) \n"
        "cmp $0, %[locked_val]\n"
        "jne Spin_Lock \n"
        "Get_Lock_Success:"
        ::
        [global_lock_addr] "r" (global_lock),
        [locked_val] "r" (locked_val)
        :
        );
        return 0;
}

The implementation of release lock is as follows:

int release_lock()
{
        int unlock_val = 0;
        __asm("xchg %[unlock_val], (%[global_lock_addr])"::
                        [global_lock_addr] "r" (global_lock),
                        [unlock_val] "r" (unlock_val)
                        :);
}

The process will exit after obtaining and releasing the lock a certain number of times

int main()
{
    ...
    printf("exec lock \n");
    for (i = 0; i < LOOP_TIME; i++) {
        get_lock();
        release_lock();
    }
    ...
}

Two executable programs can be compiled using this Makefile
- spinlock_pause : FILL_INST macro is defined as "pause" string
- spinlock_nopause : FILL_INST macro is defined as "nop" string
By using the exec.sh script, compilation and running can be completed, provided that the CUR_EXEC_NUM environment variable needs to be defined

This variable indicate how many processes will be started

Using the perf command to collect machine_clears.memory_ordering and inst_retired.any events for executing a program
Save the test results in the ./log directory

The URL of this test

test result

Test environment

x86 E5-2666 v3
core number: 20 , Thread per core : 2
export CUR_EXEC_NUM 40 (one program work on per thread)

result

spinlock_pause

get lock time 5000000
 Performance counter stats for '././spinlock_pause':

         5,228,707      machine_clears.memory_ordering:u
     5,018,001,922      inst_retired.any:u

      78.053822086 seconds time elapsed

      76.887470000 seconds user
       0.022657000 seconds sys

spinlock_nopause

get lock time 5000000
 Performance counter stats for '././spinlock_nopause':

        74,524,989      machine_clears.memory_ordering:u
    21,212,346,839      inst_retired.any:u

      73.076739387 seconds time elapsed

      72.129267000 seconds user
       0.010899000 seconds sys

From the above results, it can be seen that the pause instruction can reduce machine_clears.memory_ordering event count. (I don't know if this event is equal to memory order violation), but due to the cycle of pause instruction is larger to decrease in the number of instructions executed (inst-reired), I don't think this test can indicate that pause can avoid machines_clears.memory_ordering

On the other hand, the total execution time of programs using the pause instruction is NOT LESS than that of programs using the nop instruction.

Based on the above test results, I would like to consult everyone if there are any defects in my testing or how to explain the above test results.

Those results are probably realistic. But your inline asm is very unsafe, e.g. modifying a register you declared as a `"r"` read-only input, not a `"+r"` output/input operand. And your asm modifies memory you haven't told the compiler about (`*global_lock`, when you've only told it the pointer value is an input operand: [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259)). — Peter Cordes, Aug 21 '23 at 05:06
Either use `"+m"(*global_lock)` to let the compiler pick an addressing mode, or add a `"memory"` clobber and make your asm `volatile` (which needs to be explicit when you have at least one output operand). Also, `int locked_val` doesn't need to be an input; it looks like it could just be a dummy output operand to let the compiler pick a spare register. You already `mov $1, %[locked_val]` instead of reading the `int locked_val = 1;` initialized value you asked the compiler for. — Peter Cordes, Aug 21 '23 at 05:08
Anyway, I'm not surprised there are still memory-ordering machine clears; your `get_lock` starts out with a read-only access to the lock and branches on it, instead of starting with an `xchg` attempt. So there's a non-`lock`ed access before any `pause`. That might perhaps be good in some cases like very high contention (something real use-cases would want to avoid), but has other downsides, too, like first making a MESI share request, and then having to do a MESI RFO (Read For Ownership) before it can write the line. If the first access was `xchg`, the uncontended case would only do an RFO. — Peter Cordes, Aug 21 '23 at 05:11
See my example code in [Locks around memory manipulation via inline assembly](https://stackoverflow.com/a/37246263) (NASM syntax, despite the question asking for inline asm. You might want to make your asm code into functions that take a pointer arg, to avoid the challenges of inline asm.) (Does Intel really recommend this code, starting with a read-only access and unlocking with `xchg`? Unlock only needs to be `mov` to get memory_order_release; it doesn't need to be an atomic RMW because this thread owns the lock; we know the current value is `1` and that we're setting it to `0`.) — Peter Cordes, Aug 21 '23 at 05:18
Also re: explaining your data: it's expected that the number of machine clears per instruction will be different with vs. without `pause`, since many spin iterations can happen without a `pause` before the load finally arrives and a memory-ordering violation is detected. (And yes, that perf event counts pipeline nukes due to memory-ordering mis-speculation.) The remaining machine-clears in the version with `pause` might be mostly from the initial read on the first attempt to take the lock; I'd be curious to see how it goes starting with an `xchg` attempt, with pure-load only after `pause`. — Peter Cordes, Aug 21 '23 at 05:22
*On the other hand, the total execution time of programs using the pause instruction is NOT LESS than that of programs using the nop instruction.* - your program doesn't do anything *except* spin-wait and contend for locks. Part of the idea of Skylake's change to `pause` (making it block 100 cycles instead of 5) is to let the other hyperthread get more useful work done while we're waiting for a lock, even if that slightly delays us from noticing the lock is available. But here there is no useful work; all threads spend most of their time spin-waiting. This is not the case Intel optimized for. — Peter Cordes, Aug 21 '23 at 20:44

score 1 · Answer 1 · answered Aug 23 '23 at 08:35

Thank you for your comment. In your comment, I learned a lot about the details of inline assembly (but there are still some that I haven't understood and will continue to learn later), and made modifications to the previous code as follows:

get_lock()
- modify "r"(global_lock) input operand to "+m" (* global_lock)
- delete locked_val spare register, and let %rax to do this, and indicate it in Clobbers

 int get_lock()
 {
-       int locked_val = 1;
        __asm__(
        "Spin_Lock: \n"
-       "cmpq $0, (%[global_lock_addr]) \n"
+       "cmpq $0, %[global_lock] \n"
        "je  Get_Lock \n"
        FILL_INST "\n"
        "jmp Spin_Lock \n"
        "Get_Lock: \n"
-       "mov $1, %[locked_val] \n"
-       "xchg %[locked_val], (%[global_lock_addr]) \n" //, %[locked_val] \n"
-       "cmp $0, %[locked_val]\n"
+       "mov $1, %%rax \n"
+       "xchg %%rax, %[global_lock] \n" //, %[locked_val] \n"
+       "cmp $0, %%rax\n"
        "jne Spin_Lock \n"
        "Get_Lock_Success:"
-       ::
-       [global_lock_addr] "r" (global_lock),
-       [locked_val] "r" (locked_val)
        :
+       [global_lock] "+m" (*global_lock)
+       :
+       :
+       "%rax"
        );
        return 0;
-}
+}

release_lock() You are right. Because after obtaining the lock, only the process has exclusive write operations. In fact, The release lock code is not provided in Intel SDM,

NOTE But I don't know if the Lock prefix instruction will cause other CPUs to observe memory changes earlier when performing load operations and there seems to be no pure-load instruction with lock prefix in the x86 instruction set.In other words, after executing the store instruction with the lock prefix, the memory (cache) must contain the modified values, rather than just changing the store buffer

I lack knowledge in this area too much and need to take a look at the introduction in Intel SDM regarding this aspect

Modify the release lock as follows:
 int release_lock()
 {
-       int unlock_val = 0;
-       __asm("xchg %[unlock_val], (%[global_lock_addr])"::
-                       [global_lock_addr] "r" (global_lock),
-                       [unlock_val] "r" (unlock_val)
+       __asm__("movq $0, %[global_lock]":
+                       [global_lock] "+m" (*global_lock)
+                       :
                        :);
 }

The remaining machine-clears in the version with pause might be mostly from the initial read on the first attempt to take the lock; I'd be curious to see how it goes starting with an xchg attempt, with pure-load only after pause.

Sorry, I didn't understand the meaning of this part. Is it necessary to first execute the xchg instruction before executing the pure load instruction such as mov? For example:

 "Spin_Lock: \n"
 "mov $1, %%rax \n"
 "xchg %%rax, %[global_lock] \n" 
 "cmpq $0, %%rax \n"
 "je  Get_Lock \n"
 "pause \n"
 "jmp Spin_Lock \n"
 "Get_Lock: \n"
 "mov %[global_lock], %%rax \n" //pure-load after xchg
 "Get_Lock_Success:"

But this seems meaningless. I am not sure if atomic RMW instructions like xchg can cause memory order violation, but subsequent pure-load instructions (mov) will no longer cause memory order violation, because store operations on that memory do not occur on any other CPU before release lock

I have tested the following methods for obtaining locks:

0000000000400741 <Spin_Lock>:
        __asm__(
  400741:       48 c7 c0 01 00 00 00    mov    $0x1,%rax
  400748:       48 87 02                xchg   %rax,(%rdx)
  40074b:       48 83 f8 00             cmp    $0x0,%rax
  40074f:       90                      nop         //nop or pause
  400750:       75 ef                   jne    400741 <Spin_Lock>

0000000000400752 <Get_Lock_Success>:

The test results are as follows:

pause version

 Performance counter stats for './spinlock_pause':

             12348      machine_clears.memory_ordering
         523754760      inst_retired.any

      15.155263345 seconds time elapsed

      14.476663000 seconds user
       0.001972000 seconds sys

nop version

 Performance counter stats for './spinlock_nopause':

             14277      machine_clears.memory_ordering
         594787887      inst_retired.any

      16.325078817 seconds time elapsed

      15.421201000 seconds user
       0.001984000 seconds sys

The difference in machine_clears.memory_ordering is not significant in the different versions above, inst_retired.any, too. Can it be explained here that the cycle of pause varies in different scenarios.

Part of the idea of Skylake's change to pause (making it block 100 cycles instead of 5) is to let the other hyperthread get more useful work done while we're waiting for a lock, even if that slightly delays us from noticing the lock is available. But here there is no useful work; all threads spend most of their time spin-waiting. This is not the case Intel optimized for.

I also think so. Later, I will do some tests for this. But I still want to know how much improvement can be achieved by using the pause instruction to avoid memory order Violation (without considering CPU hyper threading optimization), or what kind of code can be written to obtain beautiful data.

Thank you again for Peter's answer.

`xchg` or `lock`-anything won't normally make data visible to other threads sooner than `mov`. The store buffer already drains itself as fast as it can; full barriers just make the current thread wait for that to happen. [Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?](https://stackoverflow.com/q/61591287) — Peter Cordes, Aug 23 '23 at 08:43
`cmpq $0, %[global_lock]` is the load I was talking about. You should do that in the spin loop if xchg fails, but not before the first `xchg`. [Locks around memory manipulation via inline assembly](https://stackoverflow.com/a/37246263) has an example take-lock function written the way I was talking about. Your current code is I think spinning on `xchg`, with no loads that aren't part of atomic RMWs except in the critical section. (The `mov` load after getting ownership of the lock is indeed useless.) — Peter Cordes, Aug 23 '23 at 08:49
With `xchg` in the retry loop (no read-only check until the lock is seen to be available), `pause` results in fewer total instructions and shorter total time. Probably because it slows down waiting threads so they don't hammer as much on the contended cache line, so they don't delay the unlocking as much. Maybe also they let the unlocking thread re-take the lock more often, since there's no pause between those ops. Unclear how many threads you're testing with, though. — Peter Cordes, Aug 23 '23 at 08:54
Is user time really less than elapsed, like less than 1 thread average? You're not measuring the `task-clock` or `cycles` events so that's all I had to guess at that. Also, 15 secs here vs. 70 in the question, so this is 4.66 times faster for the same total number of increments? — Peter Cordes, Aug 23 '23 at 08:56
Your inline asm statements need a `"memory"` clobber because you need to stop compile-time reordering of them with anything in the critical section. But yes, I think those looks correct now, if I didn't miss anything trying to only look at the `+` lines of the diff. Various micro-optimizations are possible, like `mov $1, %eax` instead of `%rax` to still zero-extend into RAX. Or better, make the lock variable 32-bit so you don't need 64-bit operand-size all over the place; you're only using 1 bit of it anyway. And `test %eax, %eax` sets FLAGS the same as `cmp $0, %eax`, but is smaller. — Peter Cordes, Aug 23 '23 at 22:50

score 0 · Answer 2 · answered Aug 31 '23 at 02:12

I'm very sorry for taking so long to reply to your comment.

release_lock: XCHG vs MOV

I have written a script here where the user retrieves the average value of all perf stat outputs. And test using it.

the test based on Aug 21 version

NOTE

This code is recorded in the spin_lock_test_8_18 branch.

Performance counter stats for 'spinlock_pause':
           4,624,476    machine_clears.memory_ordering:u
       3,410,737,040    inst_retired.any:u
     209,748,022,626    cycles:u
              65,847    task-clock:u
           66.740276    seconds time elapsed
           65.627257    seconds user
            0.011180    seconds sys
Performance counter stats for 'spinlock_nopause':
          80,937,101    machine_clears.memory_ordering:u
      24,912,493,022    inst_retired.any:u
     204,791,415,270    cycles:u
              64,285    task-clock:u
           65.064343    seconds time elapsed
           64.076759    seconds user
            0.007806    seconds sys

modify the code as follows

 int release_lock()
 {
        int unlock_val = 0;
-       __asm("xchg %[unlock_val], (%[global_lock_addr])"::
-                       [global_lock_addr] "r" (global_lock),
-                       [unlock_val] "r" (unlock_val)
-                       :);
+       //__asm("xchg %[unlock_val], (%[global_lock_addr])"::
+       //              [global_lock_addr] "r" (global_lock),
+       //              [unlock_val] "r" (unlock_val)
+       //              :);
+       __asm__("movq $0, %[global_lock]":
+                [global_lock] "+m" (*global_lock)
+                :
+                :);
 }

The result perf stat data is as follows:

Performance counter stats for 'spinlock_pause':
           2,538,595    machine_clears.memory_ordering:u
       1,059,975,768    inst_retired.any:u
      63,680,330,835    cycles:u
              19,993    task-clock:u
           20.191900    seconds time elapsed
           19.929954    seconds user
            0.003206    seconds sys
Performance counter stats for 'spinlock_nopause':
          20,891,231    machine_clears.memory_ordering:u
       7,482,792,637    inst_retired.any:u
      61,691,562,988    cycles:u
              19,370    task-clock:u
           19.567399    seconds time elapsed
           19.307589    seconds user
            0.003171    seconds sys

the test based on Aug 23 version

NOTE

This code is recorded in the master branch commit : 981feeaa2a78c7ade12b690ee0c750296f6cc03a

Performance counter stats for 'spinlock_pause':
           2,668,950    machine_clears.memory_ordering:u
       1,001,791,880    inst_retired.any:u
      60,560,465,430    cycles:u
              19,018    task-clock:u
           19.255554    seconds time elapsed
           18.954110    seconds user
            0.004128    seconds sys
Performance counter stats for 'spinlock_nopause':
          21,209,982    machine_clears.memory_ordering:u
       7,516,987,863    inst_retired.any:u
      59,556,471,561    cycles:u
              18,700    task-clock:u
           18.906565    seconds time elapsed
           18.637049    seconds user
            0.005327    seconds sys

modify rax to eax

NOTE

This code is recorded in the master branch commit : 208ce2a6057a4b97263ab218590b026a1f2e9192

Performance counter stats for 'spinlock_pause':
           2,376,406    machine_clears.memory_ordering:u
         959,953,295    inst_retired.any:u
      54,003,323,734    cycles:u
              16,964    task-clock:u
           17.232234    seconds time elapsed
           16.902346    seconds user
            0.006810    seconds sys
Performance counter stats for 'spinlock_nopause':
          19,962,974    machine_clears.memory_ordering:u
       7,034,392,892    inst_retired.any:u
      56,486,680,780    cycles:u
              17,735    task-clock:u
           17.940551    seconds time elapsed
           17.675627    seconds user
            0.004624    seconds sys

It can be found that modifying xchg to mov can significantly improve performance.

The pause instruction promotes Hyper-Threading

On a machine with 20 cores and 40 threads, run 20 spin wait loop programs and an additional 20 other processes that only run dead loops.

the dead loop code:

//FILE======only loop
int main()
{
        for (;;)
        {
        }
}

the result of this test:

Performance counter stats for 'spinlock_pause':
           2,196,923    machine_clears.memory_ordering:u
         553,103,942    inst_retired.any:u
      35,262,014,237    cycles:u
              11,075    task-clock:u
           11.220299    seconds time elapsed
           11.038079    seconds user
            0.002871    seconds sys
Performance counter stats for 'spinlock_nopause':
          12,917,030    machine_clears.memory_ordering:u
       3,510,805,805    inst_retired.any:u
      33,626,983,054    cycles:u
              10,553    task-clock:u
           10.642098    seconds time elapsed
           10.518182    seconds user
            0.002419    seconds sys
Performance counter stats for 'only_loop_spinlock_pause':
                   0    machine_clears.memory_ordering:u
      32,862,792,348    inst_retired.any:u
      42,268,579,050    cycles:u
              13,271    task-clock:u
           13.372920    seconds time elapsed
           13.227297    seconds user
            0.002964    seconds sys
Performance counter stats for 'only_loop_spinlock_nopause':
                   0    machine_clears.memory_ordering:u
      30,620,332,985    inst_retired.any:u
      40,411,022,277    cycles:u
              12,682    task-clock:u
           12.812660    seconds time elapsed
           12.639875    seconds user
            0.002321    seconds sys

It can be observed that there is no significant difference in the test results of the dead loop process between the nopause and pause versions.

NOTE OTHER

I need time to learn about CPU pipeline and cache related topics. Thank you very much : )