Why is performance of pthread_mutex so bad on Mac OS X compared to Linux?

Question

I am learning about multi-thread programming right now, and I noticed that programs with synchronization implemented with mutex is extremely slow on Mac OS X, to the extent it is usually better to use single thread instead. I understand that there are much faster ways of synchronizing, but I still wonder why it is like this. For a simple time measurement, I wrote this program.

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
pthread_mutex_t lock;
long s;
double cur_time() {
  struct timeval tp[1];
  gettimeofday(tp, NULL);
  return tp->tv_sec + tp->tv_usec * 1.0E-6;
}


void * func(){
  int n = 1000000;
  while(n > 0){
  pthread_mutex_lock(&lock);
  s++;
  n --;
  pthread_mutex_unlock(&lock);
}
return 0;
}
void * thread_func(void * arg_){
  return func();
}

int main(){
  pthread_mutex_init(&lock,NULL);
  s = 0;
  int i;
  pthread_t pids[3];

  double t1 = cur_time();
  for(i = 0; i < 3; i++){
    pthread_create(&pids[i],NULL,thread_func,NULL);
  }
  for(i = 0; i < 3; i++){
    pthread_join(pids[i],0);
  }
  printf("s = %ld\n",s);
  double t2 = cur_time();
  printf("Time consumed: %fs\n",t2 - t1);

}

This program ran for 11.022169 seconds on my MacBook Air (OS X El Capitan), which has 4GB RAM and a Intel Core i5 Dual Core 1.6GHz processor. It only ran for 0.493699 seconds on my another computer with Ubuntu 14.04, a 16GB RAM, and a Intel Core 17 Octal Core 2.4GHz processor. I understand that there is a significant difference in processing power between these two computers, but I would not expect the difference to be this huge. Besides, when using other locks, for example spinlocks, the difference is never this big.

I would be very grateful if someone could offer me some knowledge on the reason of this difference.

Added: I missed out something. I also compared spinlock and mutex on each OS respectively. While on Linux the spinlock is significantly slower than mutex with a large number of threads, on Mac OS X mutex is always much much slower.To the extent of difference by one or two digits.

There is a big difference between the two systems. How can you compare them? 1.6GHz vs 2.4GHz alone would explain a lot. More cores which is directly related to this since you are using threads, and also surely CPU cache which you are not specifiying. — Iharob Al Asimi, Nov 30 '15 at 04:51
The behavior would not be linear, and I am talking as a physicist. The performance difference is not impressive and I highly doubt it has to do with the operating system. With your Ubuntu Machine you could run OS X in a VM and it would probably be faster than your other machine. Also note that you say Core i5 Dual Core, i5 is 4 cores. And perhaps you meant Core i7 which would be 8 cores, that's twice and I really doubt the relation in the increase of performance would be linear. — Iharob Al Asimi, Nov 30 '15 at 04:56
I get ~13 seconds on my MacBook Pro (2.7 GHz Intel Core i7 quad core) and ~0.1 seconds on a Linux VM (Digital Ocean) (2.4 GHz single core). — Cornstalks, Nov 30 '15 at 04:56
@iharob Thank you. So how do you explain what Cornstalks mentioned? — antande, Nov 30 '15 at 05:01
@Cornstalks Thank you very much. Do you have an idea on why that happened? — antande, Nov 30 '15 at 05:02
Using `iprofiler -timeprofiler` on OS X I find ~72% of the time is spent in `pthread_mutex_lock_wait` (which calls `psynch_mutexwait`) and ~26% of the time in `pthread_mutex_unlock_drop` (which calls `psynch_mutexdrop`). I'm looking at it more... — Cornstalks, Nov 30 '15 at 05:02
@Cornstalks 72% explains very much, but there are like 4 seconds of difference remaining. — Iharob Al Asimi, Nov 30 '15 at 05:07
@iharob: ~99% of execution is spent in the mutex lock/unlock calls. I'm not sure where you're getting 4 unaccounted for seconds from. — Cornstalks, Nov 30 '15 at 05:14
If you put a `usleep(100000)` after the `pthread_create`, the MAC will finish in 0.3 seconds. — user3386109, Nov 30 '15 at 05:17
@Cornstalks Thank you very much for making that clear.So that basically means mutex_lock and mutex_unlock is indeed very slow on OS X, right? Do you have an idea on why it is that slow? — antande, Nov 30 '15 at 05:30
@antande It means that OS X is actually running the threads simultaneously (and doing 3 million task switches as a result), whereas Ubuntu runs the threads sequentially (and only does 3 task switches). — user3386109, Nov 30 '15 at 05:34
@user3386109 Now I get what you meant by sleeping. That is a curious situation. Why would Ubuntu schedule the threads that way? I would consider what OS X is doing is more natural? — antande, Nov 30 '15 at 05:47
I don't know for sure, but I suspect that on Ubuntu the main thread is locked out until each child thread finishes. In other words, calling `pthread_create` switches immediately to the child. And since the run time for the child thread is so short ~50msec, the time slice for the child doesn't expire before the child finishes. (See [this question](http://stackoverflow.com/questions/16401294/how-to-know-linux-scheduler-time-slice) for information about time slices on Linux.) — user3386109, Nov 30 '15 at 06:00
`double t2 = cur_time();` should be before the `printf`. Also the optimization potential of this code is high, so you should tell us your compiler flags. — mch, Nov 30 '15 at 06:03
@mch Actually, there is no optimization potential in this code. The compiler cannot optimize away calls to `pthread_mutex_lock` and `pthread_mutex_unlock`. And as Cornstalks already pointed out, ~99% of the execution time is spent in the `pthread_mutex_lock` and `pthread_mutex_unlock` calls. — user3386109, Nov 30 '15 at 06:12
This doesn't measure mutex performance at all. It just measures context switch time, and fairness. It's also super sensitive to thread launch time. On one platform, the first thread may run to completion before the second thread even starts, eliminating all the contention. — David Schwartz, Nov 30 '15 at 10:00
what happens if you replace the mutex with your own-copied-pasted spinlock? what are the running times then? that may shed some light — David Haim, Oct 04 '18 at 14:42

score 1 · Answer 1 · answered Oct 04 '18 at 14:37

Mutexes by default on MacOS apparently are implemented as "fair" mutexes. The downside of this can be significantly reduced performance; see https://blog.mozilla.org/nfroyd/2017/03/29/on-mutex-performance-part-1/

Note that when Mozilla tried switching to FIRSTFIT on mac (which isn't documented!), we found problems that blocked it: https://bugzilla.mozilla.org/show_bug.cgi?id=1353787#c7

score 0 · Answer 2 · answered Dec 01 '15 at 22:05

0

On Linux, mutexes are generally implemented in terms of the futex system call. On OS X, locking is significantly more costly because it requires sending a message to the kernel. However, this is notoriously difficult to benchmark correctly, and I haven’t examined your code.

answered Dec 01 '15 at 22:05

Jon Purdy

53,300
8
96
166

On any system, a mutex used too much for too long by many threads in parallel is going to have very bad performance. – curiousguy Nov 10 '19 at 09:25

Why is performance of pthread_mutex so bad on Mac OS X compared to Linux?

2 Answers2

Linked