boost::thread data structure sizes on the ridiculous side?

Question

Compiler: clang++ x86-64 on linux.

It has been a while since I have written any intricate low level system code, and I ussualy program against the system primitives (windows and pthreads/posix). So, the in#s and out's have slipped from my memory. I am working with boost::asio and boost::thread at the moment.

In order to emulate synchronous RPC against an asynchronous function executor (boost::io_service with multiple threads io::service::run'ing where requests are io_serviced::post'ed), I am using boost synchronization primitives. For curiosities sake I decided to sizeof the primitives. This is what I get to see.

struct notification_object
{
  bool ready;
  boost::mutex m;
  boost::condition_variable v;
};
...
std::cout << sizeof(bool) << std::endl;
std::cout << sizeof(boost::mutex) << std::endl;
std::cout << sizeof(boost::condition_variable) << std::endl;
std::cout << sizeof(notification_object) << std::endl;
...

Output:

Forty bytes for a mutex ?? ?? ? WTF ! 88 for a condition_variable !!! Please keep in mind that I'm repulsed by this bloated size because I am thinking of an application that could create hundreds of notification_object's

This level of overhead for portability seems ridiculous, can someone justify this? As far as I can remember these primitives should be 4 or 8 bytes wide depending on the memory model of the CPU.

How can you construe that the types are 'bloated' for portability and not e.g. for functionality? — Luc Danton, Jul 25 '11 at 13:00
That might very well be, however, from the documentation the functionality does not go beyond what a system specific library allows you to do. If you think its due to functionality please make an argument for it as an answer :D — Hassan Syed, Jul 25 '11 at 13:03

score 25 · Accepted Answer · edited May 23 '17 at 12:23

When you look at "size overhead" for any type of synchronization primitive, keep in mind that these cannot be packed too closely. That is so because e.g. two mutexes sharing a cacheline would end up in cache trashing (false sharing) if they're in-use concurrently, even if the users acquiring these locks never "conflict". I.e. imagine two threads running two loops:

for (;;) {
    lock(lockA);
    unlock(lockA);
}

and

for (;;) {
    lock(lockB);
    unlock(lockB);
}

You will see twice the number of iterations when run on two different threads compared to one thread running one loop if and only if the two locks are not within the same cacheline. If lockA and lockB are in the same cacheline, the number of iterations per thread will half - because the cacheline with those two locks in will permanently bounce between the cpu cores executing these two threads.

Hence even though the actual data size of the primitive data type underlying a spinlock or mutex might only be a byte or a 32bit word, the effective data size of such an object is often larger.

Keep that in mind before asserting "my mutexes are too large". In fact, on x86/x64, 40 Bytes is too small to prevent false sharing, as cachelines there are currently at least 64 Bytes.

Beyond that, if you're highly concerned about memory usage, consider that notification objects need not be unique - condition variables can serve to trigger for different events (via the predicate that boost::condition_variable knows about). It'd therefore be possible to use a single mutex/CV pair for a whole state machine instead of one such pair per state. Same goes for e.g. thread pool synchronization - having more locks than threads is not necessarily beneficial.

Edit: For a few more references on "false sharing" (and the negative performance impact caused by hosting multiple atomically-updated variables within the same cacheline), see (amongst others) the following SO postings:

As said, when using multiple "synchronization objects" (whether that'd be atomically-updated variables, locks, semaphores, ...) in a multi-core, cache-per-core config, allow each of them a separate cacheline of space. You're trading memory usage for scalability here, but really, if you get into the region where your software needs several millions of locks (making that GBs of mem), you either have the funding for a few hundred GB of memory (and a hundred CPU cores), or you're doing something wrong in your software design.

In most cases (a lock / an atomic for a specific instance of a class / struct), you get the "padding" for free as long as the object instance that contains the atomic variable is large enough.

Wow, there's a hell of a lot of background information in this post that I never would have been able to uncover on my own. +1 upboats. — Shotgun Ninja, May 22 '13 at 14:29
Maybe I missunderstand your answer, but I really don't see the relevance: You say yourself, that the size of a `boost::mutex` is not enough, to prevent false sharing, so this is obviously not the reason. Also, false sharing is not more relevant to a mutex than to any other object - would you suppose that any datatype should have at least 64 bytes to prevent that? You prevent false sharing, by adding stuff bytes between the objects, or by specifying the alignment, not by atrificially increasing the object's size. — MikeMB, Sep 04 '15 at 17:46
@MikeMB: you're mixing two things up - the size of `boost::mutex` _not_ being large enough _is_ the reason why false sharing can occur - namely, when _two_ `boost::mutex` instances are [ at least partly ] within the same cacheline. And what do you think is the difference between "artificially increasing the object size" and "adding [stuff] bytes between" ? That's one and the same. What you do depends on your usecase. To tightly-pack arrays of locking primitives, you must increase the size. To co-host a 'protected' object with the locking primitive, pack both into the same cacheline. — FrankH., Oct 29 '15 at 14:37
@FrankH.:If I "artificially increase the object size", by adding stuff bytes INTO the data type, the `sizeof` operator will return the size of the datastructure including the additional stuff bytes. If I add bytes between two objects then that is not the case. Now, assuming, what you call the *effective data size* is what the `sizeof` operator returns (thats what the OP was asking after all), the fact, that `boost::mutex` is smaller than a cacheline rather indicates, that it was NOT stuffed. — MikeMB, Oct 29 '15 at 18:21
Oh, and the reason I think it is better to not stuff the class itself is that this allows the user to decide, whether he needs to have to objects on a separate cacheline, which is - as you stated yourself - not always the case, or not, so why would you deny the user of your class that choice? — MikeMB, Oct 29 '15 at 19:41

NPE · Answer 2 · 2011-07-25T17:20:57.833

On my 64-bit Ubuntu box, the following:

#include <pthread.h>
#include <stdio.h>

int main() {
  printf("sizeof(pthread_mutex_t)=%ld\n", sizeof(pthread_mutex_t));
  printf("sizeof(pthread_cond_t)=%ld\n", sizeof(pthread_cond_t));
  return 0;
}

prints

sizeof(pthread_mutex_t)=40
sizeof(pthread_cond_t)=48

This indicates that your claim that

This level of overhead for portability seems ridiculous, can someonee justify this to me ? as far as I can remember these primitives should be 4 or 8 bytes wide depending on the memory model of the CPU.

is quite simply not true.

In case you're wondering where the extra 40 bytes taken by boost::condition_variable come from, the Boost class uses an internal mutex.

In a nutshell, on this platform boost::mutex has exactly zero overhead compared to pthread_mutex_t, and boost::condition_variable has the overhead of the extra internal mutex. Whether or not the latter is acceptable for your application is for you to decide.

P.S. I would encourage you to stick to the facts and avoid using inflammatory language in your posts. I for one very nearly decided to ignore your post purely for its tone.

On windows, the variables you see in user code are likely to be handles refering to a larger data structure somewhere - make sure you take that into account. — jcoder, Jul 25 '11 at 13:10
Yes that makes sense I guess, Thanks. I guess I'll need to investigate why that much many bytes are needed for a mutex. — Hassan Syed, Jul 25 '11 at 13:15

score 6 · Answer 3 · answered Jul 25 '11 at 13:10

Looking at the implementation:

class mutex : private noncopyable
{
public:
    friend class detail::thread::lock_ops<mutex>;

    typedef detail::thread::scoped_lock<mutex> scoped_lock;

    mutex();
    ~mutex();

private:
#if defined(BOOST_HAS_WINTHREADS)
    typedef void* cv_state;
#elif defined(BOOST_HAS_PTHREADS)
    struct cv_state
    {
        pthread_mutex_t* pmutex;
    };
#elif defined(BOOST_HAS_MPTASKS)
    struct cv_state
    {
    };
#endif
    void do_lock();
    void do_unlock();
    void do_lock(cv_state& state);
    void do_unlock(cv_state& state);

#if defined(BOOST_HAS_WINTHREADS)
    void* m_mutex;
#elif defined(BOOST_HAS_PTHREADS)
    pthread_mutex_t m_mutex;
#elif defined(BOOST_HAS_MPTASKS)
    threads::mac::detail::scoped_critical_region m_mutex;
    threads::mac::detail::scoped_critical_region m_mutex_mutex;
#endif
};

Now, let me strip the non-data parts and reorder:

class mutex : private noncopyable {
private:
#if defined(BOOST_HAS_WINTHREADS)
    void* m_mutex;
#elif defined(BOOST_HAS_PTHREADS)
    pthread_mutex_t m_mutex;
#elif defined(BOOST_HAS_MPTASKS)
    threads::mac::detail::scoped_critical_region m_mutex;
    threads::mac::detail::scoped_critical_region m_mutex_mutex;
#endif
};

So apart from noncopyable I see not much overhead that doesn't occur with system mutex's.

And noncopyable is elligible for the empty base class optimization. — R. Martinho Fernandes, Jul 25 '11 at 13:13
+1 for bothering to look at the source, where the answer is obvious — Jonathan Wakely, May 22 '13 at 15:24

score 3 · Answer 4 · answered Apr 02 '14 at 18:39

Sorry I comment this here, but I have no enough reputation for adding a comment.

@FrankH, cache trashing is not a good justification to make a data structure bigger. There are cache lines that can even have 128 bytes of size, it doesn't mean that a mutex must be so big.

I think programmers must be warned to separate synchronization objects in memory so they don't share the same cache line. What can be achieved by inserting the object in a big enough data structure, without bloating the data structure with unused bytes. On the other hand, inserting unused bytes can deteriorate the program speed, because the CPU has to fetch more cache line to access the same structure.

@Hassan Syed, I don't think that mutexes were programmed thinking in this type of cache optimization. Instead, I think that this is the way they are programmed for supporting thinks like priority inheritance, nesting locks,... . As suggestion, if you need a lot of mutexes in your program, consider something like a pool (array) of mutexes and storing just an index in your nodes (of course taking care of memory separation). I let you to think on the details of this solution.

boost::thread data structure sizes on the ridiculous side?

4 Answers4

Linked

Related