Multithreaded (de)allocation on the heap

Question

I have a very large array of ~30M objects approximately 80bytes apiece – that's ~2.2GB for those following along – stored on the disk. The actual size of each object varies a little because each one has a QMap<quint32, QVariant> child.

Unpacking those objects from raw data is expensive, so I've implemented a multithreaded read operation that pulls a few MB from disk sequentially and then passes each raw data block to a thread to get unpacked in parallel via QtConcurrent. My objects are created (via new) on the heap inside the working threads and then passed back to the main thread for the next step. Upon completion, these objects are deleted on the main thread.

In a single-threaded environment, this deallocation is relatively fast (~4-5 seconds). However, when multithreaded on 4 threads this deallocation is incredibly slow (~26-36 seconds). Profiling this with Very Sleepy indicates that the slowdown is in MSVCR100 free, so it's the deallocation itself that is slow.

Searching around SO suggests that allocating and deallocating on different threads is safe. What is the source of the slowdown, and what can I do about it?

Edit: Some sample code communicating the idea of what's going on: For the sake of troubleshooting, I have completely removed the disk IO from this example and simply create the objects and then delete them.

class MyObject
{
public:
    MyObject() { /* set defaults... irrelevant here */}
    ~MyObject() {}
    QMap<quint32, QVariant> map;
    //...other members
}

//...

QList<MyObject*> results;

/* set up the mapped lambda functor (QtConcurrent reqs std::function if returning) */
std::function<QList<MyObject*>(quint64 chunksize)>
        importMap = [](quint64 chunksize) -> QList<MyObject*>
{
    QList<MyObject*> objs;
    for(int i = 0; i < chunksize; ++i)
    {
        MyObject* obj = new MyObject();
        obj->map.insert(0, 1);      //ran with and without the map insertions
        obj->map.insert(1, 2);
        objs.append(obj);
    }
    return objs;
}; //end import map lambda

/* set up the reduce lambda functor */
auto importReduce = [&results](bool& /*noreturn*/, const QList<MyObject*> chunkimported)
{
    results.append(chunkimported);
}; //end import reduce lambda

/* chunk up the data for import */
quint64 totalcount = 31833986;
quint64 chunksize = 500000;
QList<quint64> chunklist;
while(totalcount >= chunksize)
{
    totalcount -= chunksize;
    chunklist.append(chunksize);
}
if(totalcount > 0)
    chunklist.append(totalcount);

/* create the objects concurrently */
QThreadPool::globalInstance()->setMaxThreadCount(1);    //4 for multithreaded run
QElapsedTimer tnew; tnew.start();
QtConcurrent::mappedReduced<bool>(chunklist, importMap, importReduce, QtConcurrent::OrderedReduce | QtConcurrent::SequentialReduce);
qDebug("DONE NEW %f", double(tnew.elapsed())/1000.0);

//do stuff with the objects here

/* delete the objects */
QElapsedTimer tdelete; tdelete.start();
qDeleteAll(results);
qDebug("DONE DELETE %f", double(tdelete.elapsed())/1000.0);

Here are the results with and without inserting data to MyObject::map, and with 1 or 4 threads available to QtConcurrent:

1 Thread: tnew = 2.7 seconds; tdelete = 1.1 seconds
4 Threads: tnew = 1.8 seconds; tdelete = 2.7 seconds
1 Thread + QMap: tnew = 8.6 seconds; tdelete = 4.6 seconds
4 Threads + QMap: tnew = 4.0 seconds; tdelete = 48.1 seconds

In both scenarios it takes significantly longer to delete the objects when they were created in parallel on 4 threads vs. in serial on 1 thread, which was further exacerbated by inserting to QMap in parallel.

It being safe doesn't mean it'll be fast - the opposite really. The easiest solution would be to try and use a compiler that's not six years old. Otherwise using one area allocator per thread seems like it should be the most efficient solution — Voo, Dec 20 '16 at 18:07
It could be because with one thread all the allocations are sequential, so the frees are as well. With the multithreaded allocations, they are more intermixed so `free` needs to do more work to clean up after each deallocation. — 1201ProgramAlarm, Dec 20 '16 at 18:10
Can you batch up the deallocation until all workers are finished and then perform the deallocation on a separate thread? — MikeMB, Dec 20 '16 at 18:14
Do I understand you correctly that the deallocation currently happens simultaneously with the creation of new objects in the worker threads? — MikeMB, Dec 20 '16 at 18:17
@MikeMB The deallocation happens after the creation of objects. Here's the workflow in a nutshell, with each step happening after the previous completed: (1) 1 thread reads raw data chunks from disk (2) 4 threads unpack the data into a vector of objects (3) 1-4 threads do stuff with the vector of objects (4) main thread deallocates everything. — Phlucious, Dec 20 '16 at 19:20
Standard allocator is thread-safe, so each allocation will cause lock/unlock and can't be done simultaneously in different thread. I propose you to use custom allocators. — Dmitry Sazonov, Dec 20 '16 at 21:48
Once I did next: 1) create another process that will do all necessary work and store result in a shared memory block. 2) destroy worker process for fast deallocation. 3) use shared memory in main process. — Dmitry Sazonov, Dec 20 '16 at 21:50
@Dimitry:apparently the slow part is the deallocation, which doesn't happen in parallel. — MikeMB, Dec 20 '16 at 22:55
@1201ProgramAlarm You're exactly right. To test this I called `std::random_shuffle(chunklist.begin(), chunklist.end())` before deallocation and the deletion took a whopping 73 seconds on average, even worse than the multithreaded example. Turn your comment into an answer and I'll accept it. — Phlucious, Dec 21 '16 at 22:07

dtech · Answer 1 · 2016-12-26T13:36:04.553

7

It is pretty much speculations, but I presume the OS memory manager would be one system wide, after all it does service one pool of virtual memory, so throwing more threads at it will not improve speed, it will just choke it with overhead. Thread safety coupled with concurrent access always comes with a penalty. So the more threads you throw at it the more penalty you will get.

30M allocations is quite a lot, regardless of the size of the allocations, and it also represents a significant overhead memory consumption wise. I'd recommend you invest the time to implement memory pools, preallocating monolithic chunks of memory and using placement new to allocate objects inside those pools. This will be a tremendous CPU time saver and a significant memory saver as well. Additionally, it will increase cache friendliness and cache hits by reducing fragmentation.

To put it as a metaphor, putting 4 cooks on a single stove won't make cooking 4 times faster, it will make each cook at least 4 times slower plus the time they will waste in conflict of resource usage. That's pretty much what you are seeing in practice.

edited Dec 26 '16 at 13:36

answered Dec 20 '16 at 18:07

dtech

47,916
17
112
190

So the fact that the objects occupy non-sequential memory blocks could be making the heap deallocator "jump around" because they were allocated in a multithreaded environment? I don't understand the heap well enough to follow what, exactly, `placement new` is doing. – Phlucious Dec 20 '16 at 19:23
The OS memory manager has nothing to do with anything, because it's only used to allocate fairly large chunks of memory that are then subdivided by the memory allocator proper. The latter is in the runtime of the programming environment you're using - usually the C runtime. – Kuba hasn't forgotten Monica Dec 20 '16 at 19:24
2

@Phlucious - only computations can benefit from multithreading. So using more threads for memory management can only result in a significant performance hit as different threads are fighting for synchronization. Placement new or custom allocators together with memory pools can reduce the number of allocations significantly, and you place multiple objects into one big chunk of memory. This results in significant savings in both CPU time and memory usage. On top of that it might also be more cache-friendly since data is less fragmented. – dtech Dec 20 '16 at 22:25
The interesting thing is that the slowdown occurs on deal location, when there is only a single thread anymore. – MikeMB Dec 20 '16 at 22:51
@ddriver I edited my post to add some simplistic sample code and performance metrics, if that helps put it into context. Does this look like what you described? – Phlucious Dec 21 '16 at 16:53
@Phlucious - I am yet to find the time to investigate the matter, and being quite busy I am not sure when will I find the time to dig into this. It may be the runtime, it may be QtConcurrent, it may be QMap, some wacky combination or who knows what. There is definitely something fishy. For the time being I will suggest the obvious - don't use multiple threads ;) Also, and again, memory pools and custom allocators. – dtech Dec 22 '16 at 03:53

score 2 · Accepted Answer · answered Dec 22 '16 at 02:12

2

(updating comment to answer)

It could be because with one thread all the allocations are sequential, so the frees are as well. With the multithreaded allocations, they are more intermixed so free needs to do more work to clean up after each deallocation.

answered Dec 22 '16 at 02:12

1201ProgramAlarm

32,384
7
42
56

1

Confirmed. Randomizing the order of `MyObject` before deallocation resulted in a massive slowdown, even if points were created in a single thread. This is exacerbated significantly with multithreaded QMap insertions because it creates its nodes on the heap, further fragmenting the objects, at least on my compiler and Qt 4.8. – Phlucious Dec 22 '16 at 16:26

score 1 · Answer 3 · edited Dec 21 '16 at 22:33

1

When allocating on a single memory pool from multiple threads you'll create a bottleneck during deallocation because the units being deleted sequentially are nonadjacent.

If you are using fixed size allocations, you should be able to leverage that into O(1) type performance in your allocator/dealloctor. A unit allocation system that places a bunch of blocks of the same size into a free list and then pushes/pops them as needed is something you should look into.

edited Dec 21 '16 at 22:33

Phlucious

3,704
28
61

answered Dec 20 '16 at 18:12

Michael Dorgan

12,453
3
31
61

I'm unclear on the meaning of "unit allocation system". Are you talking about managing the new/delete operations with a factory object that pre-allocates the objects and then passes them to the worker threads? – Phlucious Dec 20 '16 at 19:07
1

I'm talking about a memory allocation system that takes a memory pool and stores a bunch of pointers into it, spaced by the requested size, an place those pointers into a stack. When alloc is requested, pop a pointer and return it. When free happens, push the pointer back onto stack. O(1) alloc and free. The cost is that your block size must equal or larger than the requested memory size so you could be introducing internal fragmentation if it is used for anything besides the "80 byte" structures you mention. – Michael Dorgan Dec 21 '16 at 00:18
I've used allocators like this for many games over the years and the resulting speed up is startling. – Michael Dorgan Dec 21 '16 at 00:20
When you implement this, are you pre-allocating N objects on the main thread and then passing pointers to those objects, or are you allocating a "blank" memory pool with enough space for N objects (N*80 in my case) and using placement new? – Phlucious Dec 21 '16 at 17:00
1

Either could work. In my case, I was integrating this into our main memory allocator (replacing new/delete/malloc/free/etc.) and putting code into there that said "if size is less than N and there are available unit blocks, use unit allocator. else use whatever you would normally." In this way, I believe it is closer to your second case, though because I replace the allocators, I do not need to use placement new. In Windows, the placement new or pointer method might be easier though :) – Michael Dorgan Dec 21 '16 at 18:46

score 1 · Answer 4 · edited May 23 '17 at 11:53

1

Memory allocation and free is known to be slow, OS is sequencing the access of memory. This sequencing make new and free thread safe but also slow down things considerably.

It is common practice to pre-allocate large block of memory if each piece is fixed sized.

Another way is to use memory mapped files to bypass allocation. Qt has memory mapped file class can be used in all platform. You can try this approach, How do you serialize a QMap?

edited May 23 '17 at 11:53

Community

1
1

answered Dec 20 '16 at 18:22

m. c.

867
5
12

"OS is sequencing the access of memory" The OS got nothing to do with it. The language's runtime library does it. – Kuba hasn't forgotten Monica Dec 20 '16 at 19:23
@KubaOber You are right, memory allocation is responsibility of run time library, OS only provide memory page service. – m. c. Dec 20 '16 at 20:00

Stuart Fisher · Answer 5 · 2016-12-21T09:06:31.090

0

I'd be tempted to allocate a relatively large block of memory for each thread and within that thread try to use it as if it were a stack (or as a circular buffer). This potentially works nicely if you are always putting new objects in at one end and deleting them from the other. Or if you can de-allocate a group of objects in one step (as happens with a stack when a function call returns). Otherwise you do need the new and delete functionality that you get from a heap, which as you've discovered can be a major performance bottleneck in some cases.

Edit: I do think we are missing the point though. It doesn't really make sense that your deletes at the end are so slow. If I understand the code correctly at that point you've only got the main thread running?

edited Dec 21 '16 at 09:06

answered Dec 21 '16 at 08:55

Stuart Fisher

91
6

When I track the program with Very Sleepy, it looks like the QtConcurrent call spawns four threads, which do their work and then go dormant until the parent (main) thread dies. Does it matter that they still exist? – Phlucious Dec 21 '16 at 16:49
It shouldn't, unless they were somehow still holding mutexes that are used internally by new and delete. Another possibility is that you've got some sort of heap corruption that is being caused by the usage of QMap, and it's the heap corruption that is causing the deletion to take much longer than it normally would. I'm clutching at straws here though. – Stuart Fisher Dec 22 '16 at 17:57

Multithreaded (de)allocation on the heap

5 Answers5