13

One rule every programmer quickly learns about multithreading is:

If more than one thread has access to a data structure, and at least one of threads might modify that data structure, then you'd better serialize all accesses to that data structure, or you're in for a world of debugging pain.

Typically this serialization is done via a mutex -- i.e. a thread that wants to read or write the data structure locks the mutex, does whatever it needs to do, and then unlocks the mutex to make it available again to other threads.

Which brings me to the point: the memory-heap of a process is a data structure which is accessible by multiple threads. Does this mean that every call to default/non-overloaded new and delete is serialized by a process-global mutex, and is therefore a potential serialization-bottleneck that can slow down multithreaded programs? Or do modern heap implementations avoid or mitigate that problem somehow, and if so, how do they do it?

(Note: I'm tagging this question linux, to avoid the correct-but-uninformative "it's implementation-dependent" response, but I'd also be interested in hearing about how Windows and MacOS/X do it as well, if there are significant differences across implementations)

trincot
  • 317,000
  • 35
  • 244
  • 286
Jeremy Friesner
  • 70,199
  • 15
  • 131
  • 234
  • Seems I was completely wrong about that: https://stackoverflow.com/questions/796099/c-new-operator-thread-safety-in-linux-and-gcc-4 – πάντα ῥεῖ May 18 '19 at 06:10
  • 1
    @πάνταῥεῖ It's not that slow at least on implementations like glibc, which speed things up significantly, AFAIK by using thread-local pools. – Petr Skocik May 18 '19 at 06:19
  • @PSkocik Yes that sounds like a sound implementation to use thread local memory for the heap management. – πάντα ῥεῖ May 18 '19 at 06:21
  • @JeremyFriesner It may not only be OS dependent, but also dependent on compiler implementations. That makes your question a bit broad. – πάντα ῥεῖ May 18 '19 at 06:41
  • 1
    @πάνταῥεῖ Re, "...use thread-local memory for _the_ heap." Don't forget that _the_ heap can have multiple layers of implementation. It would be quite reasonable for each thread to have its own local cache of free blocks, and have it allocate new blocks from or return excess blocks to a shared heap as needed. – Solomon Slow May 18 '19 at 14:50
  • @SolomonSlow Heap memory managegement and real allocation are two pairs of shoes. – πάντα ῥεῖ May 18 '19 at 14:52

3 Answers3

5

new and delete are thread safe

The following functions are required to be thread-safe:

  • The library versions of operator new and operator delete
  • User replacement versions of global operator new and operator delete
  • std::calloc, std::malloc, std::realloc, std::aligned_alloc, std::free

Calls to these functions that allocate or deallocate a particular unit of storage occur in a single total order, and each such deallocation call happens-before the next allocation (if any) in this order.

With gcc, new is implemented by delegating to malloc, and we see that their malloc does indeed use a lock. If you are worried about your allocation causing bottlenecks, write your own allocator.

Community
  • 1
  • 1
Passer By
  • 19,325
  • 6
  • 49
  • 96
  • 1
    Don't forget that heap sometimes need to grow,. That's a request sent to the kernel. This may be a bottleneck that is outside the scope of the language. – Mario May 18 '19 at 08:42
  • @Mario, nope, normally requests to the operating systems happen in a very low pace, precisely to avoid such bottlenecks. It's rare that two threads happen to ask for more memory at the same time. But it is more rare that, having asked for a memory chunk simultaneously, both requests result in having two different memory expansion requests to the system. – Luis Colorado May 19 '19 at 10:37
  • Often in programming, "rare" is considered worse than "common", because if the rare situation is one you need to avoid, having it only occur very once in a blue moon makes the resulting problems harder to reproduce, and therefore harder to understand and then fix or work around. – Jeremy Friesner May 21 '19 at 03:16
  • @JeremyFriesner You would have to provide a lot more context around your use case to make any substantial statements on the behaviour. For a system only concerned about throughput, "rare" means negligible. From the generic description in your post, we can only state the generic. – Passer By May 21 '19 at 07:53
2

Answer is yes, but in practice it is usually not a problem. If it is a problem for you you may try replacing your implementation of malloc with tcmalloc that reduces, but does not eliminate possible contention(since there is only 1 heap that needs to be shared among threads and processes).

TCMalloc assigns each thread a thread-local cache. Small allocations are satisfied from the thread-local cache. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures.

There are also other options like using custom allocators and/or specialized containers and/or redesigning your application.

NoSenseEtAl
  • 28,205
  • 28
  • 128
  • 277
1

As you tried to avoid the the answer is architecture/system dependant by trying to avoid the problem that multiple threads must serialize accesses, this only happens with heaps that grow or shrink when the program needs to expand it or return part of it to the system.

The first answer has to be simply it's implementation dependant, without any system dependencies, because normally, libraries get large chunks of memory to base the heap and they administer those internally, which makes the problem actually operating system and architecture independent.

The second answer is that, of course, if you have only one single heap for all the threads, you'll have a possible bottleneck in case all of the active threads compete for a single chunk of memory. There are several approaches to this, you can have a pool of heaps to allow parallelism, and make the different threads use different pools for their requests, think that the possible largest problem is in requesting memory, as this is the case when you have the bottleneck. On returning there's not such issue, as you can act more like a garbage collector in which you queue the returned chunks of memory and enqueue them for a thread to dispatch and put those chunks in the proper places to conserve the heaps integrities. Having multiple heaps allows even to classify them by priorities, by chunk sizes, etc. so the risk of collision is made low by the class or problem you are going to deal with. This is the case of operating system kernels like *BSD, which use several memory heaps, somewhat dedicated to the kind of use they are going to receive (there's one for the io-disk buffers, one for virtual memory mapped segments, one for process virtual memory space management, etc)

I recommend you to read The design and implementation of the FreeBSD Operating System which explains very well the approach used in the kernel of BSD systems. This is general enough and probably a great percentage of the other systems follow this or a very similar approach.

Luis Colorado
  • 10,974
  • 1
  • 16
  • 31