boost::shared_mutex slower than a coarse std::mutex on Linux

Question

I have a std::unordered_map which is subjected to a very read-heavy workload from multiple threads. I could use a std::mutex for synchronization, but since concurrent reads should be fine, I wanted to use a boost::shared_mutex instead. To test the performance improvement, I first pre-populate a map with a bunch of values and then have a bunch of threads run a read test:

for (int i = 0; i < iters; ++i) map.count(random_uint(0, key_max));

I run this for my coarse-lock implementation where count is protected by std::lock_guard<std::mutex> and for my shared-lock implementation where it is protected by boost::shared_lock<boost::shared_mutex>.

On my Arch Linux x86_64 system with GCC 6.1.1 the boost::shared_lock version is always slower! On my friend's Windows 10 system with MSVC 2013, the boost::shared_lock is always faster! The complete, compilable code is on github: https://github.com/silverhammermba/sanity

Edit

This seems to be a platform-specific issue. See above. I would really appreciate if anyone else could build and run this code and report whether they saw a positive output (shared_lock is faster) or negative (course mutex is faster) and what platform you're using.

Not sure why you are puzzled. shared_mutex is always going to be slower than normal mutex, RW-lock is extra on top of normal lock. As an excercise, try implementing RW lock yourself. — SergeyA, May 18 '16 at 15:05
Why are you getting the times in the threads? Shouldn't you just just get the start time before the two loops in `read_test` and then get the end time after the last loop exits? — NathanOliver, May 18 '16 at 15:06
@NathanOliver I want to make sure that the start time is recorded only after all threads are running, and I don't want to include time waiting for threads to start in that. — Max, May 18 '16 at 15:09
@SergeyA In this test I'm not even using the W part of the lock. I'm timing **only** the reads. If that's slower than every thread waiting its turn, what's the point of using a RW-lock at all? — Max, May 18 '16 at 15:11
@Max But you do not know when all the threads end then. If you only get the start and stop from the first thread you may not be getting the correct results. Since the thread creation and destruction should be a constant it should not hurt the result to include that in both test cases. — NathanOliver, May 18 '16 at 15:15
@NathanOliver The creation/destruction is not constant if I'm comparing across different numbers of threads, which this test is also used for. Also the barriers around the start/end time ensure that the end time is only recorded once all threads have finished work. — Max, May 18 '16 at 15:21
@MaximEgorushkin just check the repo. All writes occur before any reads and before the benchmark timer is started. — Max, May 18 '16 at 16:23

score 10 · Accepted Answer · answered May 18 '16 at 17:27

It turns out that boost::shared_mutex is "suboptimal" on Linux.

The current (as of boost 1.59) implementation of boost::shared_mutex for 'pthread' is pretty suboptimal as it's using a heavyweight mutex to guard the internal mutex state... [when access concurrency is high] the shared mutex is effectively exclusive.

Hooray for boost and the many hours of my life it has stolen.

score -3 · Answer 2 · answered May 18 '16 at 15:13

Few notes:

If your data structure is suffering from high contention, using the same data structure with lock free implementation is highly recommend
Reader-Writer locks usually bring performance boost when reads are common,but writes are rare. Philosophicly speaking, if the lock has to determine if other thread caught the lock in either with read mode or write mode, it is slower than simply wait for the lock to be released. So if reads are common and writes are rare,other threads are not being blocked . If writes are common, not only the threads are blocked, but they have to perform extra logic to figure out how is locking what.

So for summery, your examples use the lock in a bad way. And also, use lock free programing if performance concures you.

boost::shared_mutex slower than a coarse std::mutex on Linux

Edit

2 Answers2