2

Example of using membarrier function from linux manual: https://man7.org/linux/man-pages/man2/membarrier.2.html

       #include <stdlib.h>

       static volatile int a, b;

       static void
       fast_path(int *read_b)
       {
           a = 1;
           asm volatile ("mfence" : : : "memory");
           *read_b = b;
       }

       static void
       slow_path(int *read_a)
       {
           b = 1;
           asm volatile ("mfence" : : : "memory");
           *read_a = a;
       }

       int
       main(int argc, char **argv)
       {
           int read_a, read_b;

           /*
            * Real applications would call fast_path() and slow_path()
            * from different threads. Call those from main() to keep
            * this example short.
            */

           slow_path(&read_a);
           fast_path(&read_b);

           /*
            * read_b == 0 implies read_a == 1 and
            * read_a == 0 implies read_b == 1.
            */

           if (read_b == 0 && read_a == 0)
               abort();

           exit(EXIT_SUCCESS);
       }

The code above transformed to use membarrier() becomes:

       #define _GNU_SOURCE
       #include <stdlib.h>
       #include <stdio.h>
       #include <unistd.h>
       #include <sys/syscall.h>
       #include <linux/membarrier.h>

       static volatile int a, b;

       static int
       membarrier(int cmd, unsigned int flags, int cpu_id)
       {
           return syscall(__NR_membarrier, cmd, flags, cpu_id);
       }

       static int
       init_membarrier(void)
       {
           int ret;

           /* Check that membarrier() is supported. */

           ret = membarrier(MEMBARRIER_CMD_QUERY, 0, 0);
           if (ret < 0) {
               perror("membarrier");
               return -1;
           }

           if (!(ret & MEMBARRIER_CMD_GLOBAL)) {
               fprintf(stderr,
                   "membarrier does not support MEMBARRIER_CMD_GLOBAL\n");
               return -1;
           }

           return 0;
       }

       static void
       fast_path(int *read_b)
       {
           a = 1;
           asm volatile ("" : : : "memory");
           *read_b = b;
       }

       static void
       slow_path(int *read_a)
       {
           b = 1;
           membarrier(MEMBARRIER_CMD_GLOBAL, 0, 0);
           *read_a = a;
       }

       int
       main(int argc, char **argv)
       {
           int read_a, read_b;

           if (init_membarrier())
               exit(EXIT_FAILURE);

           /*
            * Real applications would call fast_path() and slow_path()
            * from different threads. Call those from main() to keep
            * this example short.
            */

           slow_path(&read_a);
           fast_path(&read_b);

           /*
            * read_b == 0 implies read_a == 1 and
            * read_a == 0 implies read_b == 1.
            */

           if (read_b == 0 && read_a == 0)
               abort();

           exit(EXIT_SUCCESS);
       }

This "membarrier" description is taken from the Linux manual. I am still confused about how does trhe "membarrier" function add overhead to the slow side, and remove overhead from the fast side, thus resulting in an overall performance increase as long as the slow side is infrequent enough that the overhead of the membarrier() calls does not outweigh the performance gain on the fast side.

Could you please help me to describe it in more detail.

Thanks!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
syacer
  • 157
  • 6
  • Because `mfence` is slow, and calling membarrier() only when needed lets the fast side avoid using mfence all the time. (i.e. in a loop). – Peter Cordes May 04 '21 at 09:18
  • @PeterCordes, Thanks for your reply. I am still confused. The upper example uses "mfense" in both slow_path and fast_path. In the lower example, fast_path does not allow compiler reordering, slow_path uses membarrier(). Why can we change from using mfense in two functions to use compiler enforcement and membarrier()? Why can we replace mfense to asm volatile ("" : : : "memory"); ? – syacer May 04 '21 at 10:05
  • The secheduler might perform order-of-order execution without mfense even though we have "asm volatile ("" : : : "memory");" right? – syacer May 04 '21 at 10:10
  • It's actually the [store buffer](https://stackoverflow.com/questions/64141366/can-a-speculatively-executed-cpu-branch-contain-opcodes-that-access-ram) that creates StoreLoad reordering (the only kind x86 allows), [not OoO exec](https://stackoverflow.com/questions/50307693/does-an-x86-cpu-reorder-instructions), but yeah, that can happen when you only force the asm to do operations in program-order. – Peter Cordes May 04 '21 at 11:20
  • This pair of writes-then-read-the-other-var is https://preshing.com/20120515/memory-reordering-caught-in-the-act/. Consider what would happen if an mfencce-on-every-core had to be part of every possible order. – Peter Cordes May 04 '21 at 11:22

1 Answers1

2

This pair of writes-then-read-the-other-var is https://preshing.com/20120515/memory-reordering-caught-in-the-act/, a demo of StoreLoad reordering (the only kind x86 allows, given its program-order + store buffer with store forwarding memory model).

With only one local MFENCE you could still get reordering:

   FAST                      using just mfence, not membarrier
a = 1 exec
read_b = b;  // 0
                             b = 1;
                             mfence   (force b=1 to be visible before reading a)
                             read_a = a;   // 0
a = 1 visible (global vis. delayed by store buffer)

But consider what would happen if an mfence-on-every-core had to be part of every possible order, between the slow-path's store and its reload.

This ordering would no longer be possible. If read_b=b has already read a 0, then a=1 is already pending1 (if it isn't visible already). It's impossible for it to stay private until after read_a = a because membarrier() makes sure a full barrier runs on every core, and SLOW waits for that to happen (membarrier to return) before reading a.

And there's no way to get 0,0 from having SLOW execute first; it runs membarrier itself so its store is definitely visible to other threads before it reads a.

footnote 1: Waiting to execute, or waiting in the store buffer to commit to L1d cache. The asm("":::"memory") ensures that, but is actually redundant because volatile itself guarantees that the accesses happen in asm in program order. And we basically need volatile for other reasons when hand-rolling atomics instead of using C11 _Atomic. (But generally don't do that unless you're actually writing kernel code. Use atomic_store_explicit(&a, 1, memory_order_release);).


Note it's actually the store buffer that creates StoreLoad reordering (the only kind x86 allows), not so much OoO exec. In fact, a store buffer also lets x86 execute stores out-of-order and then make them globally visible in program order (if it turns out they weren't the result of mis-speculation or something!).

Also note that in-order CPUs can do their memory accesses out of order. They start instructions (including loads) in order, but can let them complete out of order, e.g. by scoreboarding loads to allow hit-under-miss. See also How is load->store reordering possible with in-order commit?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • intel say load can be reordered before a store to different adress;which means read happened but write maybe not completed on single cpu‘s view,if ipi insert a mb,the write hasn’t start to excute so store buffer flush cant let slow path saw fast path‘s write? – Chinaxing Jun 10 '21 at 13:41
  • 1
    @Chinaxing: A memory barrier also makes sure any earlier un-executed stores actually execute as well! Otherwise it wouldn't be very useful. (Taking an exception already means there are no earlier un-executed instructions, though. The pipeline doesn't rename the privilege level, so it can't run any instructions from the interrupt handler until after draining the ROB. See Andy Glew's explanation: [When an interrupt occurs, what happens to instructions in the pipeline?](https://stackoverflow.com/q/8902132) - he was one of the lead architects on Intel's P6 design.) – Peter Cordes Jun 10 '21 at 14:57
  • so, such behavoir exist any promise in document or white-paper ? – Chinaxing Jun 11 '21 at 09:07
  • @Chinaxing: yes, C `` and C++ `` barriers like `atomic_thread_fence(memory_order_release)` are well documented, and portable. (https://preshing.com/20130922/acquire-and-release-fences/) On x86 it doesn't need any asm instructions; on AArch64 it does. In both cases it blocks compile-time reordering, at least of atomic variables. If you're asking about where CPU memory models are documented, that's in vendor documentation, e.g. Intel's SDM manuals, and a 3rd-party documentation efforts at a formal model. – Peter Cordes Jun 11 '21 at 09:12