How can semaphores implemented on multi-core systems?

Question

I can not understand how Posix allows any thread to unlock (post) on a semaphore. Let's consider following example:

// sem1 and sem2 are a Posix semaphore,
// properly initialized for single process use
// at this point, sem2 is locked, sem1 is unlocked 
// x and y are global (non-atomic, non-volatile) integer variables
// thread 1 - is executing right now

rc = sem_wait(&sem1); // succeeded, semaphore is 0 now
x = 42;
y = 142;
sem_post(&sem2);
while (true);

// thread 2. waits for sem2 to be unlocked by thread1 
sem_wait(&sem2);
sem_post(&sem1);

// thread 3
sem_wait(&sem1); // wakes up when sem1 is unlocked by thread2
#ifdef __cplusplus
std::cout << "x: " << x << ; y: " << y << "\n";
#else
printf("x: %d; y: %d\n", x, y);
#endif

Now, according to everything I've read, this code is 100% kosher for passover. In thread 3, we are guaranteed to see x as 42, y as 142. We are proteced from any race.

But this is what I can't understand. All those threads can potentially be executed on 3 different cores. And if the chip doesn't have internally strong memory ordering (ARM, PowerPC) or writes are not-atomic (x86 for unaligned data) how can thread2 on Core2 possibly request Core1 (busy with thread1) to properly release the data / complete writes / etc? As far as I know, there are no such commands!

What I am missing here?

EDIT. Please note, suggested duplicate doesn't answer my question. It reiterates my statement, but doesn't explain how the effect can possibly be achieved. In particular, it doesn't explain how Core2 can put memory barrier on data inside Core1's cache.

I share your confusion. Looks like a potential race condition to me. Which part of your reading shows how this is safe? — wally, Mar 28 '16 at 13:42
Possible duplicate of [Is there a full memory barrier around sem\_post(sem\_t \* sem) and sem\_wait(sem\_t \* sem)?](http://stackoverflow.com/questions/16431679/is-there-a-full-memory-barrier-around-sem-postsem-t-sem-and-sem-waitsem-t) — Andrew Henle, Mar 28 '16 at 13:43
Read this: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11, where it states **The following functions synchronize memory with respect to other threads** — Andrew Henle, Mar 28 '16 at 13:43
@AndrewHenle, neither of your links answers my question, it just reiterates what I've said. I am asking **how**. — SergeyA, Mar 28 '16 at 13:46
If sem_post/sem_wait does not provide fencing then explicit fencing/memory barrier instruction is required. Are looking for specific instructions on memory fencing? — Umamahesh P, Mar 28 '16 at 13:46
@SergeyA *neither of your links answers my question, it just reiterates what I've said. I am asking how.* It took me about 30 seconds to find the question this is a duplicate of. And no, you did not ask *how*, you asked *why*: This question is titled *Why any thread can unlock a semaphore?* *Why* is because `sem_wait()` and `sem_post()` synchronize memory per the POSIX standard I linked to. You also asked *What I am missing here?* You missed the fact that POSIX requires memory synchronization from `sem_post()` and `sem_wait()`. — Andrew Henle, Mar 28 '16 at 13:50
@AndrewHenle, I missed nothing. But you seem to be missing the whole last paragraph of my question (before the edit) and haven't read past the title 'why'. — SergeyA, Mar 28 '16 at 13:52
@SergeyA Ok, fair enough. Commonly known. I still have to start with `man man`. :) — wally, Mar 28 '16 at 13:53
@SergeyA *I missed nothing*? Oh? *how can thread2 on Core2 possibly request Core1 (busy with thread1) to properly release the data / complete writes / etc? As far as I know, there are no such commands!* [Yes, there are.](https://en.wikipedia.org/wiki/Memory_barrier) Here's one for x86: http://x86.renejeschke.de/html/file_module_x86_id_170.html — Andrew Henle, Mar 28 '16 at 13:57
@AndrewHenle, nothing in this link says that mfence can be used to perform release operation on another CPU. It also is for x86, which has strong memory ordering built into it, so it's rather moot. ARM spec would be more relevant. — SergeyA, Mar 28 '16 at 13:59
@AndrewHenle, downvoting for not knowing the answer? This is something new. — SergeyA, Mar 28 '16 at 14:00
@SergeyA You didn't even bother to read the MFENCE link I provided, did you? Did you see the part where it states: **It should be noted that processors are...**. Note the use of the word processors. As in plural - more than one. I found links that actually document answers to your questions in just a few minutes. I downvoted you because you deserve it. — Andrew Henle, Mar 28 '16 at 14:04
@AndrewHenle, no, it says, that CPUs are allowed to prefetch. It's plural because of grammar. And you don't seem to understand what I am talking about at all. You miss the fact that x86 is strong memory ordering chip, so there is no sequencing issue for x86. You did downvote, and you do not seem to have even slightest idea of what I am talking about, so please move on to the next question. — SergeyA, Mar 28 '16 at 14:06
@SergeyA Google "cache coherence" and quit expecting everyone to solve your problems for you. — Andrew Henle, Mar 28 '16 at 14:09
@AndrewHenle, I am well aware of cache coherence. But you do not understand what weak memory ordering is, right? I googled something for you (hoping that you will make yourself busy with reading it and will move away from my obnoxious question): https://fgiesen.wordpress.com/2014/07/07/cache-coherency/ — SergeyA, Mar 28 '16 at 14:13
@SergeyA You mean like the PowerPC `sync` instruction? https://www.ibm.com/support/knowledgecenter/ssw_aix_53/com.ibm.aix.aixassem/doc/alangref/sync.htm%23a28692bf Multiprocessor machines have solved this problem. I can find the solutions. — Andrew Henle, Mar 28 '16 at 14:18
@SergeyA, your C example code is a tremendous distraction to what you seem actually to be asking. Indeed, if, as it now seems, your question is fundamentally about CPU instruction sets, then the question is poorly tagged. In that case, simplify your example to one semaphore, choose at most one of C or C++, and retag the question with something more likely to attract assembly programmers or experts in the CPU instruction set that interests you. At this point, and given how different that question would be, you might just delete this one and post a new one. — John Bollinger, Mar 28 '16 at 14:33
@JohnBollinger, I need two semaphores because I need to signal the other guy to post to the main one. And I used language tags because I am using the C/C++ language to illustrate the question (I am not familiar with ASM to illustrate the question in ASM). — SergeyA, Mar 28 '16 at 14:46
@SergeyA, if you're not prepared to appreciate an answer in terms of CPU instructions, and you reject answers that simply point to the specifications that say it works, then I have no idea what kind of answer you're actually looking for. — John Bollinger, Mar 28 '16 at 14:54
@JohnBollinger, I am ready to appreciate it, I can read ASM quite well. I am not ready to write ASM for point of illustrating my question. — SergeyA, Mar 28 '16 at 15:11

wally · Answer 1 · 2016-03-28T14:48:25.377

Looks like it does this by waiting for signals (controlled by the OS) from the POSIX Threads Library for Win32:

       * If the sema is posted between us being cancelled and us locking
       * the sema again above then we need to consume that post but cancel
       * anyway. If we don't get the semaphore we indicate that we're no
       * longer waiting.
       */
      if (*((sem_t *)sem) != NULL && !(WaitForSingleObject(s->sem, 0) == WAIT_OBJECT_0))
    {
      ++s->value;
#if defined(NEED_SEM)
      if (s->value > 0)
        {
          s->leftToUnblock = 0;
        }
#else
      /*
       * Don't release the W32 sema, it doesn't need adjustment
       * because it doesn't record the number of waiters.
       */
#endif /* NEED_SEM */
    }
      (void) pthread_mutex_unlock (&s->lock);
    }
}

Of course we don't have the Windows source code at our disposal. So the trail for the POSIX implementation ends there.

This is not the case for Linux. sem_wait.c makes use of futex_wait(), and it is in the source for this function that a determination is made as to whether or not the CPU can support certain functions. For example; does the CPU have a compare and exchange function? But even there the full capabilities of the architecture isn't fully considered. Functions such as lll_futex_wait() are defined in lowlevellock.h. So for PowerPC we have powerpc/lowlevellock.h, with the following snippet for example:

/* Set *futex to ID if it is 0, atomically.  Returns the old value */
#define __lll_robust_trylock(futex, id) \
  ({ int __val;                                   \
     __asm __volatile ("1:  lwarx   %0,0,%2" MUTEX_HINT_ACQ "\n"          \
               "    cmpwi   0,%0,0\n"                 \
               "    bne 2f\n"                     \
               "    stwcx.  %3,0,%2\n"                \
               "    bne-    1b\n"                     \
               "2:  " __lll_acq_instr                 \
               : "=&r" (__val), "=m" (*futex)                 \
               : "r" (futex), "r" (id), "m" (*futex)              \
               : "cr0", "memory");                    \
     __val;                                   \
  })

So the answer is probably that if it was implemented on Linux then there is probably some implementation or workaround in the architecture specific library that works around 'missing' instructions.

Not sure how this code is relevant. All I see it windows-specific WaitForSingleObject and pthread_mutex_unlock, which can't be happening from non-owning thread. — SergeyA, Mar 28 '16 at 13:53
Correct, this is from the POSIX Threads Library for Win32. In Linux the `sem_wait` function would be implemented differently but provide the same functionality. The question asked about POSIX, the OS wasn't specified. — wally, Mar 28 '16 at 13:55
Indeed,. The actual semaphore implementation is necessarily OS and architecture-specific. POSIX lives on top of that. — Martin James, Mar 28 '16 at 14:03

Shiv · Answer 2 · 2016-03-28T14:11:28.957

-1

Let if initial value of sem1 = 1 and sem2 = 0 then thread 3 and thread 1 both can take lock. It depends on which thread enters first..Suppose if thread 3 execute first then sem1 will become 0 now and thread 1 can not take the lock and also thread 2 can not take the lock because of its dependency on thread 1. And if initial value of sem1 = 0 then no thread can get lock....I think this will help..

edited Mar 28 '16 at 14:11

answered Mar 28 '16 at 14:01

Shiv

122
2
16

This is not the answer at all, sorry. – SergeyA Mar 28 '16 at 14:24

Martin James · Answer 3 · 2016-03-28T15:35:03.627

-2

...because that is how it is defined.

Any thread may post a unit to a semaphore.

A semaphore is not primarily a locking mechanism, so 'unlock' is inappropriate.

Semaphores support post and wait.

In the case of complex architectures, eg. many multi-core processors, implementation of semaphore, and other inter-thread synchro mechanisms may get quite complex. It might, for example, be necessary to send a synchro message via the inter-processor driver, so hardware-interrupting the other cores/s to force them to handle the synchro.

edited Mar 28 '16 at 15:35

answered Mar 28 '16 at 13:42

Martin James

24,453
3
36
60

Still not an answer. I have very specific question on how exactly this is achieved. – SergeyA Mar 28 '16 at 13:43
Do you have any suggestions on improving the title? I am not very happy with it myself, but I can't come up with a better one. – SergeyA Mar 28 '16 at 13:49
Well, now we are getting somewhere. My problem is that I was not able to find any indication that this 'synchro message via the inter-processor driver' exists. – SergeyA Mar 28 '16 at 13:51
How about 'How can semaphores implemented on multi-core systems? – Martin James Mar 28 '16 at 13:51
@SergeyA - how else do you think that comms is achieved across multiple cores? When a process terminates, for example, all its threads must be stopped before other resources, like memory, can be freed off. If a process thread is running on another core than the one requesting the termination, it has to be stopped, and so the other core must be hardware-interrupted. That means an inter-core driver. – Martin James Mar 28 '16 at 13:54
This question probably belongs on the 'operating systems' tag, since semaphores are OS-supported synchro primitives. – Martin James Mar 28 '16 at 13:55
Process termination is a release operation on it's own. Control is switched to OS, and at this point everything happens, full memory barierrs and whatever else is needed. By my endless loop doesn't switch, and doesn't have any memory barriers in it. – SergeyA Mar 28 '16 at 13:56
When it wants to stop a thread, the OS does not care about endless loops, barriers, whatever. It forces the core running it to enter the OS with a hardware interrupt. It needs no barriers, atomics, whatever. The thread is gonna die, no matter what it tries to do to escape. – Martin James Mar 28 '16 at 13:59
Martin, absolutely. And there is no guarantee whatsoever that modifications done by this thread are persisted. – SergeyA Mar 28 '16 at 14:02

How can semaphores implemented on multi-core systems?

3 Answers3

Linked