13

My application currently is highly performance critical and is requests 3-5 million objects per frame. Initially, to get the ball rolling, I new'd everything and got the application to work and test my algorithms. The application is multi-threaded.

Once I was happy with the performance, I started to create a memory manager for my objects. The obvious reason is memory fragmentation and wastage. The application could not continue for more than a few frames before crashing due to memory fragmentation. I have checked for memory leaks and know the application is leak free.

So I started creating a simple memory manager using TBB's concurrent_queue. The queue stores a maximum set of elements the application is allowed to use. The class requiring new elements pops elements from the queue. The try_pop method is, according to Intel's documentation, lock-free. This worked quite well as far as memory consumption goes (although there is still memory fragmentation, but not nearly as much as before). The problem I am facing now is that the application's performance has slowed down approximately 4 times according to my own simple profiler (I do not have access to commercial profilers or know of any that will work on a real-time application... any recommendation would be appreciated).

My question is, is there a thread-safe memory pool that is scalable. A must-have feature of the pool is fast recycling of elements and making them available. If there is none, any tips/tricks performance wise?

EDIT: I thought I would explain the problem a bit more. I could easily initialize n number of arrays where n is the number of threads and start using the objects from the arrays per thread. This will work perfectly for some cases. In my case, I am recycling the elements as well (potentially every frame) and they could be recycled at any point in the array; i.e. it may be from elementArray[0] or elementArray[10] or elementArray[1000] part of the array. Now I will have a fragmented array of elements consisting of elements that are ready to be used and elements that are in-use :(

Samaursa
  • 16,527
  • 21
  • 89
  • 160
  • 4
    Some thread-safe memory allocators achieve high performance by using separate pools for each thread. – jdigital Apr 20 '11 at 22:28
  • as @jdigital + different memory sizes – fazo Apr 20 '11 at 22:32
  • 4
    Is your problem "highly performance critical" or "real-time"? Those are totally opposite. A real-time application that needs to process 3-5 million objects per second is 100% performant if it can process 5.1 million objects in a second. Enough is enough, deadline met. "better-than-realtime" doesn't count, and therefore. A performance-critical app on the other hand benefits from every extra % of performance. The distinction is critical for memory manager design. – MSalters Apr 21 '11 at 09:40
  • While a full featured scalable allocator may be a bit overkill for you, you may still notice in the end using it is simple and it gives you results good enough for you. See also http://stackoverflow.com/questions/2514278/scalable-memory-allocator-experiences – Suma Apr 21 '11 at 09:54
  • @MSalters: Well, it is real-time and must process 5 million objects for now in real-time per frame. That is my current goal. My final goal is to allow a little more than 16 million which is a number based on a grid. Finally, the need for memory manager is that right now fragmentation is a big problem. The program crashes after just a few frames of processing due to the huge amounts of fragmentation. – Samaursa Apr 22 '11 at 02:58
  • Fragmentation depends quite a bit on object sizes. Do you have a rough idea how the object sizes are distributed? Ideally, of course, all objects have the same size, but then you wouldn't have seen fragmentation at all. If you know that the distribution of the size of those 5 million objects is constant over time, your memory manager design can explicitly account for that - don't coalesce blocks you're going to split anyway. – MSalters Apr 22 '11 at 09:13

3 Answers3

7

As said in comments, don't get a thread-safe memory allocator, allocate memory per-thread.

As you implied in your update, you need to manage free/in-use effectively. That is a pretty straightforward problem, given a constant type and no concurrency.

For example (off the top of my head, untested):

template<typename T>
class ThreadStorage
{
    std::vector<T> m_objs;
    std::vector<size_t> m_avail;

public:
    explicit ThreadStorage(size_t count) : m_objs(count, T()) {
        m_avail.reserve(count);
        for (size_t i = 0; i < count; ++i) m_avail.push_back(i);
    }

    T* alloc() {
        T* retval = &m_objs[0] + m_avail.back();
        m_avail.pop_back();
        return retval;
    }

    void free(T* p) {
        *p = T(); // Assuming this is enough destruction.
        m_avail.push_back(p - &m_objs[0]);
    }
};

Then, for each thread, have a ThreadStorage instance, and call alloc() and free() as required.

You can add smart pointers to manage calling free() for you, and you can optimise constructor/destructor calling if that's expensive.

You can also look at boost::pool.

Update:

The new requirement for keeping track of things that have been used so that they can be processed in a second pass seems a bit unclear to me. I think you mean that when the primary processing is finished on an object, you need to not release it, but keep a reference to it for second stage processing. Some objects you will just be released back to the pool and not used for second stage processing.

I assume you want to do this in the same thread.

As a first pass, you could add a method like this to ThreadStorage, and call it when you want to do processing on all unreleased instances of T. No extra book keeping required.

void do_processing(boost::function<void (T* p)> const& f) {
    std::sort(m_avail.begin(), m_avail.end());

    size_t o = 0;
    for (size_t i = 0; i != m_avail.size(); ++i) {
        if (o < m_avail[i]) {
            do {
                f(&m_objs[o]);
            } while (++o < m_avail[i]);
            ++o;
        } else of (o == m_avail[i])
            ++o;
    }

    for (; o < m_objs.size(); ++o) f(&m_objs[o]);
}

Assumes no other thread is using the ThreadStorage instance, which is reasonable because it is thread-local by design. Again, off the top of my head, untested.

janm
  • 17,976
  • 1
  • 43
  • 61
  • One problem is that I am using TBB::tasks and it is integrated quite deeply. I am not sure how I will manage the pool per (unique) task. Do you know (by any chance) how I can tackle this? – Samaursa Apr 21 '11 at 00:57
  • I don't use TBB, but it uses operating system threads, which allows you to use thread local storage. So, I'd start with something like boost::thread_specific_ptr and go from there. Of course you need to manage lifetime consistently with how you manage your thread lifetime and workload. – janm Apr 21 '11 at 01:16
  • This actually worked quite nicely, but I ran into a snag which I had encountered before as well, but had solved with a `hash_map`. Unfortunately, that solution is too slow. The problem is that I need to keep track of used elements as well (and any point they may become _unused_) which I will iterate through later on. If I use a `hash_map`, as mentioned, the frame time increases 1.5x. With a vector container, the frame time is very close to that with raw arrays, but now whenever I recycle an element, I have to do a costly iteration. Damned if I do, damned if I don't :O – Samaursa Apr 21 '11 at 05:22
  • I will select this as the correct answer because it helped me a lot with coming up with a solution that works (although still not ideal but very close to my goals). The sorting was too slow, so I actually opted to create another array which is sparsely filled with used elements. Every time an element is recycled, since I know the index, I access the sparse array with that index, and set that element to NULL. In the end, I traverse the sparse array in parallel and skip over, quite quickly, the NULL elements. Thanks janm for your awesome solution! – Samaursa Apr 22 '11 at 07:37
  • @Nik-Lz No, I mean just create a instance(s) ThreadStorage in the per-thread code. No need for `thread_local`. Of course allocation/deallocation is necessary, the assumption here is that it is all on the same thread. This answer is about managing the storage, using a class to help managing calls to `alloc()` and `free()` are obviously good. – janm Sep 19 '18 at 07:53
  • @Nik-Lz You can have different threads executing the same function on different data. That's the point of this approach; threads have different auto variables. `thread_local` is for solving a different type of problem. – janm Sep 19 '18 at 14:52
  • @Nik-Lz Threads acting on the same data requires caution regardless of whether or not it is the same function. In your example, if they are acting on different parts of the vector but not changing the vector itself, synchronisation is not required. When the same function running on different threads, they have different scopes for `auto` variables, so there is no need for `thread_local` inside `myFunc()`. For example, two threads running a function `f() { MemAllocator ma; o = ma.alloc(); .... ; ma.free(o) }` have a different instance of `ma` and require no special care. – janm Oct 02 '18 at 06:45
4

Google's TCMalloc,

TCMalloc assigns each thread a thread-local cache. Small allocations are satisfied from the thread-local cache. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures.

Performance:

TCMalloc is faster than the glibc 2.3 malloc... ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair...

James Crook
  • 1,600
  • 12
  • 17
2

You may want to have a look at jemalloc.

CAFxX
  • 28,060
  • 6
  • 41
  • 66