High system CPU usage when contending futex

Question

I have observed that when the linux futexes are contended, the system spends A LOT of time in the spinlocks. I noticed this to be a problem even when futexes are not used directly, but also when calling malloc/free, rand, glib mutex calls, and other system/library calls that make calls to futex. Is there ANY way of getting rid of this behavior?

I am using CentOS 6.3 with kernel 2.6.32-279.9.1.el6.x86_64. I also tried the latest stable kernel 3.6.6 downloaded directly from kernel.org.

Originally, the problem occurred on a 24-core server with 16GB RAM. The process has 700 threads. The data collected with "perf record" shows that the spinlock is called from the futex called from __lll_lock_wait_private and __lll_unlock_wake_private, and is eating away 50% of the CPU time. When I stopped the process with gdb, the backtraces showed the calls to __lll_lock_wait_private __lll_unlock_wake_private are made from malloc and free.

I was trying to reduce the problem, so I wrote a simple program that shows it's indeed the futexes that are causing the spinlock problem.

Start 8 threads, with each thread doing the following:

   //...
   static GMutex *lMethodMutex = g_mutex_new ();
   while (true)
   {
      static guint64 i = 0;
      g_mutex_lock (lMethodMutex);
      // Perform any operation in the user space that needs to be protected.
      // The operation itself is not important.  It's the taking and releasing
      // of the mutex that matters.
      ++i;
      g_mutex_unlock (lMethodMutex);
   }
   //...

I am running this on an 8-core machine, with plenty of RAM.

Using "top", I observed that the machine is 10% idle, 10% in the user mode, and 90% in the system mode.

Using "perf top", I observed the following:

 50.73%  [kernel]                [k] _spin_lock
 11.13%  [kernel]                [k] hpet_msi_next_event
  2.98%  libpthread-2.12.so      [.] pthread_mutex_lock
  2.90%  libpthread-2.12.so      [.] pthread_mutex_unlock
  1.94%  libpthread-2.12.so      [.] __lll_lock_wait
  1.59%  [kernel]                [k] futex_wake
  1.43%  [kernel]                [k] __audit_syscall_exit
  1.38%  [kernel]                [k] copy_user_generic_string
  1.35%  [kernel]                [k] system_call
  1.07%  [kernel]                [k] schedule
  0.99%  [kernel]                [k] hash_futex

I would expect this code to spend some time in the spinlock, since the futex code has to acquire the futex wait queue. I would also expect the code to spend some time in the system, since in this snippet of code there is very little code running in the user space. However, 50% of time spent in the spinlock seems to be excessive, especially when this cpu time is needed to do other useful work.

You might want to say a few words about what behaviour you would like to see. I feel that this is not entirely clear. — NPE, Nov 08 '12 at 16:43
Using a mutex or futex to concurrently increment a variable as in the above example is a bit silly, as this can be directly done with an atomic increment (somewhere from 50 to 500 times more efficient). In "real" code, i.e. code that actually does something, I find congestion and time wasted spinning rather neglegible details. Real code doesn't compete for a lock from half a dozen threads at a time. — Damon, Nov 08 '12 at 19:37
Originally, I noticed this to be a problem even when futexes are not called directly from the user code; this happens when calling malloc/free, rand, glib mutex calls, and other system/library calls that make calls to futex. The snippet of code given in the problem description is just to demonstrate the occurence of the problem, and by no means does it represent any useful work. In fact, the code between the calls to mutex can be any user code. — Alex Fiddler, Nov 08 '12 at 19:45

Antti · Answer 1 · 2012-11-08T21:29:50.327

3

I've ran into similar issues as well. My experience is that you may see a performance hit or even deadlocks when locking and unlocking a lot, depending on the libc version and a lot of other obscure things (e.g. calls to fork() like here).

This guy solved his performance problems by switching to tcmalloc, which may be a good idea anyway depending on the use case. It could be worth a try for you as well.

For me, I saw a reproducible deadlock when I had multiple threads doing lots of locking and unlocking. I was using a Debian 5.0 rootfs (embedded system) with a libc from 2010, and the issue was fixed by upgrading to Debian 6.0.

edited Nov 08 '12 at 21:29

answered Nov 08 '12 at 21:23

Antti

11,944
2
24
29

I tried jemalloc, and the problem is not happening anymore. This is not surprising, since jemalloc relies much less on locking of the arenas than glibc. This, however, does not completely solve the problem, since the root cause of the issue is that the spinlock of the futex is held on for too long, causing all the other executing threads to pile up waiting for the spinlock to be released (as demonstrated by my little snippet of code in the original description of the problem). – Alex Fiddler Nov 27 '12 at 15:16

High system CPU usage when contending futex

1 Answers1