Does the typical malloc
(for x86-64 platform and Linux OS) naively lock a mutex at the beginning and release it when done, or does it lock a mutex in a more clever way at a finer level, so that lock contention is reduced for concurrent calls? If it indeed does it the second way, how does it do it?

- 328,167
- 45
- 605
- 847

- 20,589
- 43
- 136
- 219
-
What is the context where have you seen that?any quoted code or reference? – Raulp May 22 '12 at 17:04
-
6softly: I am asking, not saying. – pythonic May 22 '12 at 17:05
3 Answers
glibc 2.15
operates multiple allocation arenas. Each arena has its own lock. When a thread needs to allocate memory, malloc()
picks an arena, locks it, and allocates memory from it.
The mechanism for choosing an arena is somewhat elaborate and is aimed at reducing lock contention:
/* arena_get() acquires an arena and locks the corresponding mutex.
First, try the one last locked successfully by this thread. (This
is the common case and handled with a macro for speed.) Then, loop
once over the circularly linked list of arenas. If no arena is
readily available, create a new one. In this latter case, `size'
is just a hint as to how much memory will be required immediately
in the new arena. */
With this in mind, malloc()
basically looks like this (edited for brevity):
mstate ar_ptr;
void *victim;
arena_lookup(ar_ptr);
arena_lock(ar_ptr, bytes);
if(!ar_ptr)
return 0;
victim = _int_malloc(ar_ptr, bytes);
if(!victim) {
/* Maybe the failure is due to running out of mmapped areas. */
if(ar_ptr != &main_arena) {
(void)mutex_unlock(&ar_ptr->mutex);
ar_ptr = &main_arena;
(void)mutex_lock(&ar_ptr->mutex);
victim = _int_malloc(ar_ptr, bytes);
(void)mutex_unlock(&ar_ptr->mutex);
} else {
/* ... or sbrk() has failed and there is still a chance to mmap() */
ar_ptr = arena_get2(ar_ptr->next ? ar_ptr : 0, bytes);
(void)mutex_unlock(&main_arena.mutex);
if(ar_ptr) {
victim = _int_malloc(ar_ptr, bytes);
(void)mutex_unlock(&ar_ptr->mutex);
}
}
} else
(void)mutex_unlock(&ar_ptr->mutex);
return victim;
This allocator is called ptmalloc
. It is based on earlier work by Doug Lea, and is maintained by Wolfram Gloger.

- 486,780
- 108
- 951
- 1,012
Doug Lea's malloc
used coarse locking (or no locking, depending on the configuration settings), where every call to malloc
/realloc
/free
is protected by a global mutex. This is safe but can be inefficient in highly multithreaded environments.
ptmalloc3
, which is the default malloc
implementation in the GNU C library (libc) used on most Linux systems these days, has a more fine-grained strategy, as described in aix's answer, which allows multiple threads to concurrently allocate memory safely.
nedmalloc
is another independent implementation which claims even better multithreaded performance than ptmalloc3
and various other allocators. I don't know how it works, and there doesn't seem to be any obvious documentation, so you'll have to check the source code to see how it works.

- 1
- 1

- 390,455
- 97
- 512
- 589
-
1I'm still undecided on whether nedmalloc is a real engineering feat or SEO spam... :-) – R.. GitHub STOP HELPING ICE May 22 '12 at 18:39
-
also tcmalloc from google which uses locks on buckets sized to your request. Better thread performance with less contention, more surplus allocation. – evil otto May 22 '12 at 20:09
-
2@R..: It does look a little suspect at first glance, but it has source code so you can benchmark it yourself (I have not done so). Doug Lea also says "If you are using malloc in a concurrent program, consider instead using nedmalloc or ptmalloc" in the comments for `dlmalloc.c`. So I think it's probably legit. – Adam Rosenfield May 22 '12 at 21:06
In addition to ptmalloc
mentioned by @NPE, there is also tcmalloc
which is offered by Google and, under certain scenarios, it has been tested to run slightly faster than ptmalloc
(e.g.: when doing malloc()
for 10^6 times and frees them).
It uses global heap and per-thread heaps so that the global heap can be used by every thread. There is only mutex on the global heap and not the per-thread heap. For each individual thread, they could pull storage from the global heap and free that storage so this storage goes to the per-thread heap.
And each per-thread heap is owned by individual threads. Unless things are imbalanced, then threads would move the storage back to the global heap.
here for code implementation: https://github.com/google/tcmalloc

- 33
- 5