How Can synchronize data between differernt cores on Xeon (linux how to use memory barriers)

Question

I wrote a simple program to test memory Synchronization. Use a global queue to share with two processes, and bind two processes to different cores. my code is blew.

 #include<stdio.h>
 #include<sched.h>
 #define __USE_GNU

 void bindcpu(int pid) {
     int cpuid;
     cpu_set_t mask;
     cpu_set_t get;
     CPU_ZERO(&mask);

     if (pid > 0) {
         cpuid = 1;
     } else {
         cpuid = 5;
     }

     CPU_SET(cpuid, &mask);

     if (sched_setaffinity(0, sizeof(mask), &mask) == -1) {
         printf("warning: could not set CPU affinity, continuing...\n");
     }
 }

 #define Q_LENGTH 512
 int g_queue[512];

 struct point {
     int volatile w;
     int volatile r;
 };  

 volatile struct point g_p;


 void iwrite(int x) {
     while (g_p.r == g_p.w);
     sleep(0.1);
     g_queue[g_p.w] = x;
     g_p.w = (g_p.w + 1) % Q_LENGTH;
     printf("#%d!%d", g_p.w, g_p.r);
 }

 void iread(int *x) {
     while (((g_p.r + 1) % Q_LENGTH) == g_p.w);
     *x = g_queue[g_p.r];
     g_p.r = (g_p.r + 1) % Q_LENGTH;
     printf("-%d*%d", g_p.r, g_p.w);
 }

 int main(int argc, char * argv[]) {
     //int num = sysconf(_SC_NPROCESSORS_CONF);
     int pid;

     pid = fork();
     g_p.r = Q_LENGTH;
     bindcpu(pid);
     int i = 0, j = 0;

     if (pid > 0) {
         printf("call iwrite \0");
         while (1) {
             iread(&j);
         }
     } else {
         printf("call iread\0");
         while (1) {
             iwrite(i);
             i++;
         }
     }

 }

The data between the two processesIntel(R) Xeon(R) CPU E3-1230 and two cores didn't synchronized.

CPU: Intel(R) Xeon(R) CPU E3-1230 OS: 3.8.0-35-generic #50~precise1-Ubuntu SMP

I want to know beyond IPC How I can synchronize the data between the different cores in user space ?

You will need to use some form of inter-process communication if you want the two processes to be able to communicate with one another. there is a good answer on inter-process communication here -> http://stackoverflow.com/questions/2682462/c-fork-and-sharing-memory. Also, just an observation, you are not initializing your point struct instance. — KorreyD, Dec 31 '13 at 15:22
the global initializing with 0 by default. thanks, for the link. — hangkongwang, Dec 31 '13 at 16:18
I want konw beyond the IPC, is the a way to control cache-memory from user space. Or any system call like "rmb(), wmb()" — hangkongwang, Dec 31 '13 at 16:24

score 0 · Answer 1 · edited May 23 '17 at 12:12

If you are wanting your application to manipulate the cpus shared-cache in order to accomplish IPC I don't believe you will be able to do that.

Chapter 9 of "Linux Kernel Development Second Edition" has information on synchronizing multi-threaded applications (including atomic operations, semiphores, barriers, etc...): http://www.makelinux.net/books/lkd2/ch09

so you may get some ideas on what you are looking for there.

here is a decent write up for Intel® Smart Cache "Software Techniques for Shared-Cache Multi-Core Systems": http://archive.is/hm0y

here are some stackoverflow questions/answers that may help you find the information you are looking for:

Storing C/C++ variables in processor cache instead of system memory

C++: Working with the CPU cache

Understanding how the CPU decides what gets loaded into cache memory

Sorry for bombarding you with links but this is the best I can do without a clearer understanding of what you are looking to accomplish.

Thanks! Is my problem that I didn't clear what I want to know. — hangkongwang, Jan 01 '14 at 00:41
I readed this links, I think shared chache is cool and it's starmt. My cpu got L3 can share, but it didn't work. How can I check the cache configure on my system, and How can I modify it ? — hangkongwang, Jan 01 '14 at 02:13

score 0 · Answer 2 · answered Jan 03 '14 at 16:16

I suggest reading "Volatile: Almost Useless for Multi-Threaded Programming" for why volatile should be removed from the example code. Instead, use C11 or C++11 atomic operations. See also the Fenced Data Transfer example in of the TBB Design Patterns Manual.

Below I show the parts of the question example that I changed to use C++11 atomics. I compiled it with g++ 4.7.2.

 #include <atomic>

...

  struct point g_p;

  struct point {
      std::atomic<int> w;
      std::atomic<int> r;
  };

 void iwrite(int x) {
     int w = g_p.w.load(std::memory_order_relaxed);
     int r;
     while ((r=g_p.r.load(std::memory_order_acquire)) == w);
     sleep(0.1);
     g_queue[w] = x;
     w = (w+1)%Q_LENGTH;
     g_p.w.store( w, std::memory_order_release);
     printf("#%d!%d", w, r);
 }

 void iread(int *x) {
     int r = g_p.r.load(std::memory_order_relaxed);
     int w;
     while (((r + 1) % Q_LENGTH) == (w=g_p.w.load(std::memory_order_acquire)));
     *x = g_queue[r];
     g_p.r.store( (r + 1) % Q_LENGTH, std::memory_order_release );
     printf("-%d*%d", r, w);
 }

The key changes are:

I removed "volatile" everywhere.
The members of struct point are declared as std::atomic
Some loads and stores of g_p.r and g_p.w are fenced. Others are hoisted.
When loading a variable modified by another thread, the code "snapshots" it into a local variable.

The code uses "relaxed load" (no fence) where a thread loads a variable that no other thread modifies. I hoisted those loads out of the spin loops since there is no point in repeating them.

The code uses "acquiring load" where a thread loads a "message is ready" indicator that is set by another thread, and uses a "releasing store" where it is storing a "message is ready" indicator" to be read by another thread. The release is necessary to ensure that the "message" (queue data) is written before the "ready" indicator (member of g_p) is written. The acquire is likewise necessary to ensure that the "message" is read after the "ready" indicator is seen.

The snapshots are used so that the printf reports the value that the thread actually used, as opposed to some new value that appeared later. In general I like to use the snapshot style for two reasons. First, touching shared memory can be expensive because it often requires cache-line transfers. Second, the style gives me a stable value to use locally without having to worry that a reread might return a different value.

I find there is mistake in my code, I supposed fork will let processes share the same mm_struct which means porcesses is runing the memory space, but I checked kernel code, processes from fork the the memory is "copy on write", so each process have it's own copy on in the own memory space (implemnet by page fault). — hangkongwang, Jan 04 '14 at 03:08
And use vfork and memory barriy may accomplish what I want. I want to run the code as fast an the mutilcore can, one produce, on consume, whit out any IPC, with out any atmic, and with out any pipeline borken, and hit cache as high rate as possible. — hangkongwang, Jan 04 '14 at 03:15

How Can synchronize data between differernt cores on Xeon (linux how to use memory barriers)

2 Answers2