malloc / new lock and multithreading

Question

How should I use new in a multithread environment?

Precisely: I have a piece of code that I run with 40 threads. Each thread invokes new a few times. I noticed that performance drops, probably because threads lock in new (significant time is spent in __lll_lock_wait_parallel and __lll_unlock_wait_parallel). What is the best alternative to new / delete I can use?

C++, thanks. Yes, I use new, not malloc. Just edited question — Jakub M., Nov 03 '11 at 08:47
_probably_ due to locks? Tests from even several years ago even indicate that GCC manages no loss in performance for allocation from multiple threads (that is, total time is linear with the allocations, not with the threads). — edA-qa mort-ora-y, Nov 03 '11 at 08:57
@edA-qa mort-ora-y: thanks, thats a good hint. _locks_ was my best shot - maybe there is other cause.. But, you have other `malloc`s (see Will's answer), so there _is_ an issue of multithread malloc.. — Jakub M., Nov 03 '11 at 09:36
In additon to @Will's answer, here is a good list of [Multithreaded Memory Allocators for C/C++](http://stackoverflow.com/q/147298/425817) — ali_bahoo, Nov 03 '11 at 11:29

score 6 · Answer 1 · answered Nov 03 '11 at 09:02

6

Even if you are using the new operator, its using malloc underneath to do the allocation and deallocation. The focus should be on the allocator and not the API used to reach it in these circumstances.

TCMalloc is a malloc created at Google specifically for good performance in a multi-threading environment. It is part of google-perf-tools.

Another malloc you might look at is Hoard. It has much the same aims as TCMalloc.

answered Nov 03 '11 at 09:02

Will

73,905
40
169
246

Is it really true that new uses malloc underneath? – user695652 May 04 '16 at 19:53
@user695652 it usually does. – Will May 06 '16 at 07:56
All of the Google links are death. Could you update them? – Peter VARGA Feb 17 '18 at 15:13

cnicutar · Answer 2 · 2011-11-03T08:53:58.940

5

I don't know about "the best", but I would try a few things:

Reduce the frequency of allocations / frees (might be hard). Just waste memory (but don't leak) if it improves performance
Roll my own, per-thread allocator and always alloc / free from the same thread using mmap for the real memory

To roll your own primitive allocator:

Use mmap to obtain a large chunk of memory from the OS
Use a data structure (linked list, tree etc) to keep track of free and used blocks
Never free data allocated by another thread

I don't consider this trivial to do but if done right it could improve performance. The hairiest part is by far keeping track of the allocations, preventing fragmentation etc.

A simple implementation is provided in "The C Programming Language", near the end of the book (but it uses brk IIRC).

edited Nov 03 '11 at 08:53

answered Nov 03 '11 at 08:48

cnicutar

178,505
25
365
392

Could you elaborate the second point, about "per-thread allocator"? – Jakub M. Nov 03 '11 at 08:50
@downvoter I really love it when you downvote and don't comment. – cnicutar Nov 03 '11 at 09:11
I downvoted because I don't think this a good solution to recommend. At the detail level, mmap is for mapping files; if you want to reserve a large address range use malloc to get memory for your custom allocator. Also it doesn't play well with new and delete. Thirdly its a linear allocator. On consideration, bad advice. – Will Nov 03 '11 at 09:44
@Will I can agree with the `new` part (my solution is constructor oblivious). But you are aware some `malloc` implementations call `mmap`, right ? (I think the `jemalloc` does this and other BSD's too). **`mmap` can be used to reserve memory**. In any case, I cave *no problem whatsoever* with the downvote if it's accompanied by a comment. BTW, what do you mean by me suggesting a "linear allocator" ? – cnicutar Nov 03 '11 at 09:48
@cnicutar yes I'm a bit of a allocator junky and have commercial implementations under my belt back at symbian. That said, I would hesitate to tread the same path these days. Suggesting someone try and make their own allocator instead of use, say, jemalloc or tcmalloc is just bad advice imo. Linear was not a valid criticism. Anonymous mappings in mmap is not posix, but sure everyone likely does it. Hmm less details about your recommendations. My concern is ever suggesting anyone make their own malloc; its not a profitable general-purpose solution to anything. – Will Nov 03 '11 at 09:52
@Will We can agree to disagree :-) I myself like to roll my own things (and I must admit I didn't know about `tcmalloc`). – cnicutar Nov 03 '11 at 09:54

sam · Answer 3 · 2011-11-09T10:51:42.080

2

i think you should using memory pool . allocate all memory you need (if size is Fix) at the first time when your project started and let the arrays the memory that they need from the first array you allocated .

edited Nov 09 '11 at 10:51

answered Nov 03 '11 at 09:12

sam

1,363
1
20
32

score 1 · Answer 4 · answered Nov 03 '11 at 11:46

I tend to use object pools in servers and other such apps that are characterized by continual and frequent allocation and release of large numbers of a few sets of objects, (in servers - socket, buffer and buffer-collection classes). The pools are queues, created at startup with an appropriate number of instances pushed on, (eg. my server - 24000 sockets, 48000 collections and an array of 7 pools of buffers of varying size/count). Popping an object instance off a queue and pushing it back on is much quicker than new/delete, even if the pool queue has a lock because it is shared across the threads, (the smaller the lock span, the smaller the chance of contention). My pooled-object class, (from which all the sockets etc. are inherited), has a private 'myPool' member, (loaded at startup), and a 'release()' method with no parameters & so any buffer is easily and correctly returned to its own pool. There are issues:

1) Ctor and dtor are not called upon allocate/release & so allocated objects contain all the gunge left over from their last use. This can occasionally be useful, (eg. re-useable socket objects), but generally means that care needs to be taken over, say, the initial state of booleans, value of int's etc.

2) A pool per thread has the greatest performance improvement potential - no locking required, but in systems where the loading on each thread is intermittent, ths can be an object waste. I never seem to be able to get away with this, mainly because I use pooled objects for inter-thread comms and so release () has to be thread-safe anyway.

3) Elimination of 'false sharing' on shared pools can be awkward - each instance should be initially 'newed' so as to exclusively use up an integer number of cache pages. At least this only has to be done once at startup.

4) If the system is to be resilient upon a pool running out, either more objects need to be allocated to add to the pool when needed, (the pool size is then creeping up), or a producer-consumer queue can be used so that threads block on the pool until objects are released, (P-C queues are slower because of the condvar/semaphore/whatever for waiting threads to block on, also threads that allocate before releasing can deadlock on an empty pool).

5) Monitoring of the pool levels during development is required so that object leakages and double-releases can be detected. Code/data can be added to the objects/pools to detect such errors as they happen but this compromises performance.

score 1 · Answer 5 · answered Nov 03 '11 at 12:44

1

1st, do you really have to "new" that thing ? Why not use a local variable or a per-thread heap object.

2nd, have a look at http://en.wikipedia.org/wiki/Thread-local_storage if your development environment supports it...

answered Nov 03 '11 at 12:44

Malkocoglu

2,522
2
26
32

score 0 · Answer 6 · answered Nov 03 '11 at 13:32

Since nobody mentioned it, I might also suggest trying to use Boehm's conservative garbage collector; this means using new(gc) instead of new, GC_malloc instead of malloc and don't bother about free-ing or delete-ing memory objects. A couple of years ago, I measured GC_malloc versus malloc, it was a bit slower (perhaps 25µs for GC_malloc versus 22µs for system malloc).

I have no idea of the performance of Boehm's GC in multi-threaded usage (but I do know it can be used in multi-threaded applications).

Boehm's GC has the advantage that you should not care about free-ing your data.

malloc / new lock and multithreading

6 Answers6