18

Let's imagine that I have a few worker threads such as follows:

while (1) {
    do_something();

    if (flag_isset())
        do_something_else();
}

We have a couple of helper functions for checking and setting a flag:

void flag_set()   { global_flag = 1; }
void flag_clear() { global_flag = 0; }
int  flag_isset() { return global_flag; }

Thus the threads keep calling do_something() in a busy-loop and in case some other thread sets global_flag the thread also calls do_something_else() (which could for example output progress or debugging information when requested by setting the flag from another thread).

My question is: Do I need to do something special to synchronize access to the global_flag? If yes, what exactly is the minimum work to do the synchronization in a portable way?

I have tried to figure this out by reading many articles but I am still not quite sure of the correct answer... I think it is one of the following:

A: No need to synchronize because setting or clearing the flag does not create race conditions:

We just need to define the flag as volatile to make sure that it is really read from the shared memory every time it is being checked:

volatile int global_flag;

It might not propagate to other CPU cores immediately but will sooner or later, guaranteed.

B: Full synchronization is needed to make sure that changes to the flag are propagated between threads:

Setting the shared flag in one CPU core does not necessarily make it seen by another core. We need to use a mutex to make sure that flag changes are always propagated by invalidating the corresponding cache lines on other CPUs. The code becomes as follows:

volatile int    global_flag;
pthread_mutex_t flag_mutex;

void flag_set()   { pthread_mutex_lock(flag_mutex); global_flag = 1; pthread_mutex_unlock(flag_mutex); }
void flag_clear() { pthread_mutex_lock(flag_mutex); global_flag = 0; pthread_mutex_unlock(flag_mutex); }

int  flag_isset()
{
    int rc;
    pthread_mutex_lock(flag_mutex);
    rc = global_flag;
    pthread_mutex_unlock(flag_mutex);
    return rc;
}

C: Synchronization is needed to make sure that changes to the flag are propagated between threads:

This is the same as B but instead of using a mutex on both sides (reader & writer) we set it in only in the writing side. Because the logic does not require synchronization. we just need to synchronize (invalidate other caches) when the flag is changed:

volatile int    global_flag;
pthread_mutex_t flag_mutex;

void flag_set()   { pthread_mutex_lock(flag_mutex); global_flag = 1; pthread_mutex_unlock(flag_mutex); }
void flag_clear() { pthread_mutex_lock(flag_mutex); global_flag = 0; pthread_mutex_unlock(flag_mutex); }

int  flag_isset() { return global_flag; }

This would avoid continuously locking and unlocking the mutex when we know that the flag is rarely changed. We are just using a side-effect of Pthreads mutexes to make sure that the change is propagated.

So, which one?

I think A and B are the obvious choices, B being safer. But how about C?

If C is ok, is there some other way of forcing the flag change to be visible on all CPUs?

There is one somewhat related question: Does guarding a variable with a pthread mutex guarantee it's also not cached? ...but it does not really answer this.

Community
  • 1
  • 1
snap
  • 2,751
  • 22
  • 33
  • 1
    Just a note that if you're using GNU C, you can declare your variable with type `sig_atomic_t` to ensure that get/set can be done in one instruction. – Michael Mior Aug 28 '11 at 19:35
  • `sig_atomic_t` makes the access of the variable atomic (which is not really the point of this question) but does this guarantee cache coherency? Or is there any cache coherency issue to worry about in the first place? – snap Aug 28 '11 at 19:42
  • 1
    No, this has nothing to do with cache coherency. This probably should have been a comment on Sparky's answer, as it's relevant in that context. – Michael Mior Aug 28 '11 at 19:44

4 Answers4

14

The 'minimum amount of work' is an explicit memory barrier. The syntax depends on your compiler; on GCC you could do:

void flag_set()   {
  global_flag = 1;
  __sync_synchronize(global_flag);
}

void flag_clear() {
  global_flag = 0;
  __sync_synchronize(global_flag);
}

int  flag_isset() {
  int val;
  // Prevent the read from migrating backwards
  __sync_synchronize(global_flag);
  val = global_flag;
  // and prevent it from being propagated forwards as well
  __sync_synchronize(global_flag);
  return val;
}

These memory barriers accomplish two important goals:

  1. They force a compiler flush. Consider a loop like the following:

     for (int i = 0; i < 1000000000; i++) {
       flag_set(); // assume this is inlined
       local_counter += i;
     }
    

    Without a barrier, a compiler might choose to optimize this to:

     for (int i = 0; i < 1000000000; i++) {
       local_counter += i;
     }
     flag_set();
    

    Inserting a barrier forces the compiler to write the variable back immediately.

  2. They force the CPU to order its writes and reads. This is not so much an issue with a single flag - most CPU architectures will eventually see a flag that's set without CPU-level barriers. However the order might change. If we have two flags, and on thread A:

      // start with only flag A set
      flag_set_B();
      flag_clear_A();
    

    And on thread B:

      a = flag_isset_A();
      b = flag_isset_B();
      assert(a || b); // can be false!
    

    Some CPU architectures allow these writes to be reordered; you may see both flags being false (ie, the flag A write got moved first). This can be a problem if a flag protects, say, a pointer being valid. Memory barriers force an ordering on writes to protect against these problems.

Note also that on some CPUs, it's possible to use 'acquire-release' barrier semantics to further reduce overhead. Such a distinction does not exist on x86, however, and would require inline assembly on GCC.

A good overview of what memory barriers are and why they are needed can be found in the Linux kernel documentation directory. Finally, note that this code is enough for a single flag, but if you want to synchronize against any other values as well, you must tread very carefully. A lock is usually the simplest way to do things.

bdonlan
  • 224,562
  • 31
  • 268
  • 324
  • Thank You for this Excellent answer! It is exactly to the point I was uncertain about. Because these memory barrier definitions are compiler and/or OS specific, portable code can not use them. As you say: _A lock is usually the simplest way to do things._ Would you be able to elaborate (going back to my original question): will option C be guaranteed to be correct because it uses (an otherwise unneeded "dummy" lock) or are the locking calls needed on the _read_ side as well in each iteration? What would be the most efficient yet still portable way of enforcing the memory barrier? – snap Aug 29 '11 at 04:35
  • Answering my own previous comment: I suppose there is no way of reliably skipping the locking on the _read_ side because that side could still be vulnerable to funny compiler or CPU read ordering. Would a dummy call to some other pthread_* function be more efficient than pthread_mutex_lock() followed by pthread_mutex_unlock() when we actually do not need a real lock but just a memory barrier? There are many other functions that do memory synchronization in the Open Group list. – snap Aug 29 '11 at 04:49
  • 1
    Ok, this http://stackoverflow.com/questions/1185063/memory-barriers-in-userspace-linux-x86-64 answers the above question. Pthreads mutex is the proper way to get a memory barrier when using Pthreads even when a real mutex is not needed by the program logic. – snap Aug 29 '11 at 12:44
4

You must not cause data race cases. It is undefined behavior and the compiler is allowed to do anything and everything it pleases.

A humorous blog on the topic: http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong

Case 1: There is no synchronization on the flag, so anything is allowed to happen. For example, the compiler is allowed to turn

flag_set();
while(weArentBoredLoopingYet())
    doSomethingVeryExpensive();
flag_clear()

into

while(weArentBoredLoopingYet())
    doSomethingVeryExpensive();
flag_set();
flag_clear()

Note: this kind of race is actually very popular. Your millage may vary. One one hand, the de-facto implementation of pthread_call_once involves a data race like this. On the other hand, it is undefined behavior. On most versions of gcc, you can get away with it because gcc chooses not to exercise its right to optimize this way in many cases, but it is not "spec" code.

B: full synchronization is the right call. This is simply what you have to do.

C: Only synchronization on the writer could work, if you can prove that no one wants to read it while it is writing. The official definition of a data race (from the C++11 specification) is one thread writing to a variable while another thread can concurrently read or write the same variable. If your readers and writers all run at once, you still have a race case. However, if you can prove that the writer writes once, there is some synchronization, and then the readers all read, then the readers do not need synchronization.

As for caching, the rule is that a mutex lock/unlock synchronizes with all threads that lock/unlock the same mutex. This means you will not see any unusual caching effects (although under the hood, your processor can do spectacular things to make this run faster... it's just obliged to make it look like it wasn't doing anything special). If you don't synchronize, however, you get no guarantees that the other thread doesn't have changes to push that you need!

All of that being said, the question is really how much are you willing to rely on compiler specific behavior. If you want to write proper code, you need to do proper synchronization. If you are willing to rely on the compiler to be kind to you, you can get away with a lot less.

If you have C++11, the easy answer is to use atomic_flag, which is designed to do exactly what you want AND is designed to synchronize correctly for you in most cases.

Cort Ammon
  • 10,221
  • 31
  • 45
0

For the example you have posted, case A is sufficient provided that ...

  1. Getting and setting the flag takes only one CPU instruction.
  2. do_something_else() is not dependent upon the flag being set during the execution of that routine.

If getting and/or setting the flag takes more than one CPU instruction, then you must some form of locking.

If do_something_else() is dependent upon the flag being set during the execution of that routine, then you must lock as in case C but the mutex must be locked before calling flag_isset().

Hope this helps.

Sparky
  • 13,505
  • 4
  • 26
  • 27
  • do_something() and do_something_else() do not touch the flag and do not necessarily call any pthreads routines. Why does using one CPU instruction guarantee that the other CPU caches are invalidated? – snap Aug 28 '11 at 19:20
  • @snap: I forgot that case C dealt with multiple processors. I am generally accustomed to dealing with only uniprocessor systems. I do not know for certain what will happen on an MP system. I would expect (perhaps incorrectly) that the hardware should take care of that synchronization. May others please correct me if I am wrong. – Sparky Aug 28 '11 at 19:32
  • I am worried exactly that the hardware does not necessarily take care of the memory synchronization automatically. I was looking at http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11 for example. It talks about certain functions performing memory synchronization. If it is guaranteed to happen any way, why are they specifically listing functions which synchronize memory between threads? – snap Aug 28 '11 at 20:00
  • 1
    @snap Most (all?) multicore processors have cache coherence protocols that ensure that loads and stores are visible across processors. The biggest problem is with compiler and cpu instruction re-ordering which is dealt with memory barriers (see bdonlan's answer). Most pthread primitives will eventually end up using an atomic operand like getAndIncrement or compareAndSwap which are natual memory barriers. This ensure that synced instructions won't be re-ordered outside the bounds formed by the lock and unlock calls and guarantees that all loads and stores are visible after a call to unlock. – Ze Blob Aug 28 '11 at 21:09
-2

Assigning incoming job to worker threads requires no locking. Typical example is webserver, where the request is catched by a main thread, and this main thread selects a worker. I'm trying explain it with some pesudo code.

main task {

  // do forever
  while (true)

    // wait for job
    while (x != null) {
      sleep(some);
      x = grabTheJob(); 
    }

    // select worker
    bool found = false;
    for (n = 0; n < NUM_OF_WORKERS; n++)
     if (workerList[n].getFlag() != AVAILABLE) continue;
     workerList[n].setJob(x);
     workerList[n].setFlag(DO_IT_PLS);
     found = true;
    }

    if (!found) panic("no free worker task! ouch!");

  } // while forever
} // main task


worker task {

  while (true) {
    while (getFlag() != DO_IT_PLS) sleep(some);
    setFlag(BUSY_DOING_THE_TASK);

    /// do it really

    setFlag(AVAILABLE);

  } // while forever 
} // worker task

So, if there are one flag, which one party sets is to A and another to B and C (the main task sets it to DO_IT_PLS, and the worker sets it to BUSY and AVAILABLE), there is no confilct. Play it with "real-life" example, say, when the teacher is giving different tasks to students. The teacher selects a student, gives him/her a task. Then, the teacher looks for next available student. When a student is ready, he/she gets back to the pool of available students.

UPDATE: just clarify, there are only one main() thread and several - configurable number of - worker threads. As main() runs only one instance, there is no need to sync the selection and launc of the workers.

ern0
  • 3,074
  • 25
  • 40
  • 1
    You need a memory barrier between `setJob` and `setFlag`, and again between `getFlag()` and where you actually access the resulting job. Failure to do so may result in subtle memory ordering bugs on certain architectures. See http://www.kernel.org/doc/Documentation/memory-barriers.txt for details. – bdonlan Aug 28 '11 at 20:34
  • 2
    This is some of the worst example code I've ever seen. Among its many problems, consider two threads running the "select worker" code at the same time. And how long do you "sleep"? Too long, and you're delaying every job, increasing the chances you'll run out of workers. Not long enough and you'll waste CPU. Also, the code forces the choice of worker rather than leaving the scheduler free to choose the most efficient worker. (So if you have more jobs than cores, you'll wind up with way more context switches than needed.) This is an example of making every possible mistake! – David Schwartz Aug 29 '11 at 16:07
  • 1
    I found a good simple example which shows easily in practice how anything like this is bound to fail: http://jakob.engbloms.se/archives/65 ... It varies how often it fails depending on the exact CPU architecture, but it fails surprisingly often! – snap Aug 29 '11 at 16:25
  • Dear @David Schwartz, the main() is running in one instance, therefore no sync needed. Only the one-instance main() sets worker tasks to DO_IT_PLS. – ern0 Aug 30 '11 at 18:26
  • It doesn't matter. You can still fail horribly in dozens of ways. For example, what happens if the thread assigned the job sees the write to the flag but not the write to the job itself? The thread will do the wrong job entirely. – David Schwartz Aug 30 '11 at 18:29
  • I don't understand something. The jobs are distinct things, aren't they? Like in case of a webserver: different requests (or same!) to different clients. The main() is responsible to prepare the job (e.g. if the request arrives in one "channel", make a copy of it for the selected job). – ern0 Aug 30 '11 at 18:36
  • 1
    The problem is that when a memory write is done in one thread, there is no guarantee that another thread will see that write. It may, for example, have the value cached in a CPU register. This is why you need locking. – David Schwartz Aug 31 '11 at 07:01