Malloc performance in a multithreaded environment

Question

I've been running some experiments with the openmp framework and found some odd results I'm not sure I know how to explain.

My goal is to create this huge matrix and then fill it with values. I made some parts of my code like parallel loops in order to gain performance from my multithreaded enviroment. I'm running this in a machine with 2 quad-core xeon processors, so I can safely put up to 8 concurrent threads in there.

Everything works as expected, but for some reason the for loop actually allocating the rows of my matrix have an odd peak performance when running with only 3 threads. From there on, adding some more threads just makes my loop take longer. With 8 threads taking actually more time that it would need with only one.

This is my parallel loop:

 int width = 11;
 int height = 39916800;
 vector<vector<int> > matrix;
 matrix.resize(height);    
 #pragma omp parallel shared(matrix,width,height) private(i) num_threads(3)
 {
   #pragma omp for schedule(dynamic,chunk)
   for(i = 0; i < height; i++){
     matrix[i].resize(width);
   }
 } /* End of parallel block */

This made me wonder: is there a known performance problem when calling malloc (which I suppose is what the resize method of the vector template class is actually calling) in a multithreaded enviroment? I found some articles saying something about performance loss in freeing heap space in a mutithreaded enviroment, but nothing specific about allocating new space as in this case.

Just to give you an example, I'm placing below a graph of the time it takes for the loop to finish as a function of the number of threads for both the allocation loop, and a normal loop that just reads data from this huge matrix later on.

Parallel region 1, where there is allocation

Parallel region 2, where there is no allocation

Both times where measured using the gettimeofday function and seem to return very similar and accurate results across different execution instances. So, anyone has a good explanation?

Just forgot to say, I'm running on Ubuntu 11.04 (64 bits). – Bilthon Sep 21 '11 at 17:47 — Bilthon, Sep 21 '11 at 17:47

score 8 · Accepted Answer · answered Sep 21 '11 at 18:14

You are right about vector::resize() internally calling malloc. Implementation-wise malloc is fairly complicated. I can see multiple places where malloc can lead to contention in a multi-threaded environment.

malloc probably keeps a global data structure in userspace to manage the user's heap address space. This global data structure would need to be protected against concurrent access and modification. Some allocators have optimizations to alleviate the number of times this global data structure is accessed... I don't know how far has Ubuntu come along.
malloc allocates address space. So when you actually begin to touch the allocated memory you would go through a "soft page fault" which is a page fault which allows the OS kernel to allocate the backing RAM for the allocated address space. This can be expensive because of the trip to the kernel and would require the kernel to take some global locks to access its own global RAM resource data structures.
the user space allocator probably keeps some allocated space to give out new allocations from. However, once those allocations run out the allocator would need to go back to the kernel and allocate some more address space from the kernel. This is also expensive and would require a trip to the kernel and the kernel taking some global locks to access its global address space management related data structures.

Bottomline, these interactions could be fairly complicated. If you are running into these bottlenecks I would suggest that you simply "pre-allocate" your memory. This would involve allocating it and then touching all of it (all from a single thread) so that you can use that memory later from all your threads without running into lock contention at user or kernel level.

If you are running into bottlenecks, I would rather suggest the OP use an MT-aware/optimized malloc. Use ugly hacks like pre-allocation only as a last resort - since they are inherently incompatible with things like the STL: how would you get your vector to use this pre-allocated memory above. Basically, you'd have to re-write your application, or venture into dark corners like operator new(). — BeeOnRope, Sep 22 '11 at 06:18
Do you have a recommendation for an MT-aware malloc, may be with a way to integrate it with C++ new? — ritesh, Sep 22 '11 at 18:47
Yeah, the link at the end of my answer below is to another SO question which discusses the pros/cons of various MT allocators (with the general consensus being "They are all good, you should benchmark". — BeeOnRope, Sep 24 '11 at 02:43
Integrating them with your project is typically as easy as adding them to the link-line on UNIX, or using something like LD_PRELOAD. On Windows, it's more involved - hoard uses "detours", for example, a way of rewriting function entry points at runtime. I can't tell what OS the OP is using. Each library generally has good documentation for how to use them as the default malloc. Once you replace malloc and friends, it typically applies to C++ new, since they use almost invariably use malloc (or friends) internally. — BeeOnRope, Sep 24 '11 at 02:44

score 2 · Answer 2 · edited May 23 '17 at 12:00

Memory allocators are definitely a possible contention point for multiple threads.

Fundamentally, the heap is a shared data structure, since it is possible to allocate memory on one thread, and de-allocate it on another. In fact, your example does exactly that - the "resize" will free memory on each of the worker threads, which was initially allocated elsewhere.

Typical implementations of malloc included with gcc and other compilers use a shared global lock and work reasonably well across threads if memory allocation pressure is relatively low. Above a certain allocation level, however, threads will begin to serialize on the lock, you'll get excessive context switching and cache trashing, and performance will degrade. Your program is an example of something which is allocation heavy, with an alloc + dealloc in the inner loop.

I'm surprised that an OpenMP compatible compiler doesn't have a better threaded malloc implementation? They certainly exist - take a look at this question for a list.

score 1 · Answer 3 · answered Sep 23 '11 at 14:21

Technically, the STL vector uses the std::allocator which eventually calls new. new in its turn calls the libc's malloc (for your Linux system).

This malloc implementation is quite efficient as a general purpose allocator, is thread-safe, however it is not scalable (the GNU libc's malloc derives from Doug Lea's dlmalloc). There are numerous allocators and papers that improve upon dlmalloc to provide scalable allocation.

I would suggest that you take a look at Hoard from Dr. Emery Berger, tcmalloc from Google and Intel Threading Building Blocks scalable allocator.

Malloc performance in a multithreaded environment

3 Answers3

Linked