Memory management while using threads

Question

1) I tried searching how memory would be allocated when we use threads in program but couldn't find the answer. Here What and where are the stack and heap? is how stack and heap works when a single program is called. But what happens when it comes to program with threads?

2)Using OpenMP parallel region creates threads and parallel code would be executed concurrently in each thread. Does this allocate more space in the memory than the memory occupied by same code with sequential execution?

Fizz · Answer 1 · 2015-01-22T15:53:30.527

In general, yes, [user-space] stacks are one per thread, whereas the heap is usually shared by all threads. See for example this Linux question. However, on some operating systems (OS), on Windows in particular, even a single threaded app may use more than one heap. Using OpenMP for threading doesn't change these basics, which are mostly dependant on the operating system. So unless you narrow your question to a specific OS, more can't be said at this level of generality.

Since I'm too lazy to draw this myself, here's the comparative illustration from PThreads Programming by Nichols et al. (1996) enter image description here

A somewhat more detailed (and alas potentially a bit more confusing) diagram is found in the free LLNL POSIX Threads Programming tutorial by B. Barney.

And yes, as you correctly suspected, running more threads does consume more stack memory. You can actually exhaust the virtual address space of a process just with thread stacks if you make enough of them. Various implementations of OpenMP have a STACKSIZE environment variable (or thereabout) that controls how much stack OpenMP allocates for a thread.

Regarding Z boson's question/suggestion about Thread Local Storage (TLS): roughly (i.e. conceptually) speaking, Thread Local Storage is a per-thread heap. There are differences from the per-process heap in the API used to manipulate it, at the very least because each thread needs its own separate pointer to its own TLS, but basically you have a heap-like chunk of the process address space that's reserved to each thread. TLS is optional, you don't have to use it. OpenMP provides its own abstraction/directive for TLS-like persistent per-thread data, called THREADPRIVATE. It's not necessary that the OpenMP THREADPRIVATE uses the operating system's TLS support, however there's a Linux-focused paper which says that such an implementation gave the best performance, at least in that environment.

And here is a subtlety (or why I said "roughly speaking" when I compared TLS to per-thread heaps): assume you want a per-thread heap, say, in order to reduce locking contention to the main heap. You don't actually have to store an entire per-thread heap in each thread's TLS. It suffices to store in each thread's TLS a different head pointer to heaps allocated in the shared per-process space. Identifying and automatically using per-thread heaps in a program (in order to reduce locking contention on the main heap) is a farily difficult CS problem. Heap allocators which do this automatically are called scalable/parallel[izing] heap allocators or thereabout. For example, Intel TBB provides one such allocator, and it can be used in your program even if you use nothing else from TBB. Although some people seem to believe Intel's TBB allocator contains black magic, it's in fact not really different from the aforementioned basic idea of using TLS to point to some thread-local heap, which in turn is made of several doubly-linked lists segregated by block/object-size, as the following diagrams from the Intel paper on TBB illustrate: enter image description here

IBM has something rather similar for AIX 7.1, but a bit more complex. You can tell its (default) allocator to use a fixed number of heaps for multi-threaded applications, e.g. MALLOCOPTIONS=multiheap:3. AIX 7.1 also has another option (which can be combined the multiheap) MALLOCOPTIONS=threadcache, which appears somewhat similar to what Intel TBB does, in that it keeps a per-thread cache of deallocated regions, from which future allocation requests can be serviced with less global heap contention. Besides those options for the default allocator, AIX 7.1 also has a (non-default) "Watson2" allocator which "uses a thread-specific mechanism that uses a varying number of heap structures, which depend on the behavior of the program. Therefore no configuration options are required." (But you do need to select this allocator explicitly with MALLOCTYPE=Watson2.) Watson2's operation sounds even closer to what the Intel TBB allocator does.

The aforementioned two examples (Intel TBB and AIX) detailed above just meant as concrete examples, but shouldn't be understood as holding some exclusive sauce. The idea of per-thread or per-CPU heap cache/arena/magazine is fairly widespread. The BSDcan jemalloc paper cites a 1998 MS Research paper as the first to have systematically evaluated arenas for this purpose. The aforementioned MS paper does cite the ptmalloc web page as "visited on May 11, 1998" and summarizes ptmalloc's working as follows: "It uses a linked list of subheaps where each subheap has a lock, 128 free lists, and some memory to manage. When a thread needs to allocate a block, it scans the list of subheaps and grabs the first unlocked one, allocates the required block, and returns. If it can't find an unlocked subheap, it creates a new one and adds it to the list. In this way, a thread never waits on a locked subheap."

@Respwned Fluff There would be memory layout for a c program as in . Does each copy of parallel region code for each thread also takes some space within this process text region? — Rakesh, Jan 21 '15 at 12:19
It depends on the application. For some applications the code executed by threads is the same; for other applications each thread can be very specialized, so there's no actual code sharing between them. But all threads (by definition) have access to the entire process address space, so all could in theory use/share the same code. This code sharing (or non-sharing) issue is orthogonal to stacks, by the way. — Fizz, Jan 21 '15 at 12:24
I didn't get the word "othogonal to stacks ". what does it mean..? — Rakesh, Jan 21 '15 at 12:33
It means that you have a thread per stack regardless of whether the threads run the same code or not. Note that the text segment is shared and visible to all threads, but that doesn't mean that in practice they'll all run in the same region of code. I've added a diagram to the answer, and I'll add a link to another source (with more explanations) shortly. — Fizz, Jan 21 '15 at 12:53
It is worth mentioning that the memory allocator in GLIBC (ptmalloc2, if I recall correctly) has had support for per-thread arenas since very long time and some server distributions, e.g. RHEL, ship with it enabled while other (mostly desktop) distributions don't do that since it increases the memory footprint of the applications. ptmalloc2 uses one arena per CPU core and is mostly lock-free. — Hristo Iliev, Jan 22 '15 at 09:31
What's more intriguing to me is [why lock-free mallocs haven't made much of a practical impact](http://programmers.stackexchange.com/questions/270798/why-dont-we-see-more-widespread-adoption-of-lock-free-dynamic-memory-allocato). — Fizz, Jan 22 '15 at 10:20
@Hristo Iliev: after a bit of digging it does appear that ptmalloc was the first (or at least among the first few) allocators that used multiple heaps/arenas for improved multi-threaded performance. I've updated the post with what I found. Thanks for bringing it up. My only worry with adding the info that you and Z boson suggested is that it will be overwhelming for the OP (Rakesh)... — Fizz, Jan 22 '15 at 16:56

Memory management while using threads

1 Answers1