6

I have a very large array of ~30M objects approximately 80bytes apiece – that's ~2.2GB for those following along – stored on the disk. The actual size of each object varies a little because each one has a QMap<quint32, QVariant> child.

Unpacking those objects from raw data is expensive, so I've implemented a multithreaded read operation that pulls a few MB from disk sequentially and then passes each raw data block to a thread to get unpacked in parallel via QtConcurrent. My objects are created (via new) on the heap inside the working threads and then passed back to the main thread for the next step. Upon completion, these objects are deleted on the main thread.

In a single-threaded environment, this deallocation is relatively fast (~4-5 seconds). However, when multithreaded on 4 threads this deallocation is incredibly slow (~26-36 seconds). Profiling this with Very Sleepy indicates that the slowdown is in MSVCR100 free, so it's the deallocation itself that is slow.

Searching around SO suggests that allocating and deallocating on different threads is safe. What is the source of the slowdown, and what can I do about it?

Edit: Some sample code communicating the idea of what's going on: For the sake of troubleshooting, I have completely removed the disk IO from this example and simply create the objects and then delete them.

class MyObject
{
public:
    MyObject() { /* set defaults... irrelevant here */}
    ~MyObject() {}
    QMap<quint32, QVariant> map;
    //...other members
}

//...

QList<MyObject*> results;

/* set up the mapped lambda functor (QtConcurrent reqs std::function if returning) */
std::function<QList<MyObject*>(quint64 chunksize)>
        importMap = [](quint64 chunksize) -> QList<MyObject*>
{
    QList<MyObject*> objs;
    for(int i = 0; i < chunksize; ++i)
    {
        MyObject* obj = new MyObject();
        obj->map.insert(0, 1);      //ran with and without the map insertions
        obj->map.insert(1, 2);
        objs.append(obj);
    }
    return objs;
}; //end import map lambda

/* set up the reduce lambda functor */
auto importReduce = [&results](bool& /*noreturn*/, const QList<MyObject*> chunkimported)
{
    results.append(chunkimported);
}; //end import reduce lambda

/* chunk up the data for import */
quint64 totalcount = 31833986;
quint64 chunksize = 500000;
QList<quint64> chunklist;
while(totalcount >= chunksize)
{
    totalcount -= chunksize;
    chunklist.append(chunksize);
}
if(totalcount > 0)
    chunklist.append(totalcount);

/* create the objects concurrently */
QThreadPool::globalInstance()->setMaxThreadCount(1);    //4 for multithreaded run
QElapsedTimer tnew; tnew.start();
QtConcurrent::mappedReduced<bool>(chunklist, importMap, importReduce, QtConcurrent::OrderedReduce | QtConcurrent::SequentialReduce);
qDebug("DONE NEW %f", double(tnew.elapsed())/1000.0);

//do stuff with the objects here

/* delete the objects */
QElapsedTimer tdelete; tdelete.start();
qDeleteAll(results);
qDebug("DONE DELETE %f", double(tdelete.elapsed())/1000.0);

Here are the results with and without inserting data to MyObject::map, and with 1 or 4 threads available to QtConcurrent:

  • 1 Thread: tnew = 2.7 seconds; tdelete = 1.1 seconds
  • 4 Threads: tnew = 1.8 seconds; tdelete = 2.7 seconds
  • 1 Thread + QMap: tnew = 8.6 seconds; tdelete = 4.6 seconds
  • 4 Threads + QMap: tnew = 4.0 seconds; tdelete = 48.1 seconds

In both scenarios it takes significantly longer to delete the objects when they were created in parallel on 4 threads vs. in serial on 1 thread, which was further exacerbated by inserting to QMap in parallel.

Community
  • 1
  • 1
Phlucious
  • 3,704
  • 28
  • 61
  • 5
    It being safe doesn't mean it'll be fast - the opposite really. The easiest solution would be to try and use a compiler that's not six years old. Otherwise using one area allocator per thread seems like it should be the most efficient solution – Voo Dec 20 '16 at 18:07
  • 1
    It could be because with one thread all the allocations are sequential, so the frees are as well. With the multithreaded allocations, they are more intermixed so `free` needs to do more work to clean up after each deallocation. – 1201ProgramAlarm Dec 20 '16 at 18:10
  • Can you batch up the deallocation until all workers are finished and then perform the deallocation on a separate thread? – MikeMB Dec 20 '16 at 18:14
  • Do I understand you correctly that the deallocation currently happens simultaneously with the creation of new objects in the worker threads? – MikeMB Dec 20 '16 at 18:17
  • @MikeMB The deallocation happens after the creation of objects. Here's the workflow in a nutshell, with each step happening after the previous completed: (1) 1 thread reads raw data chunks from disk (2) 4 threads unpack the data into a vector of objects (3) 1-4 threads do stuff with the vector of objects (4) main thread deallocates everything. – Phlucious Dec 20 '16 at 19:20
  • Standard allocator is thread-safe, so each allocation will cause lock/unlock and can't be done simultaneously in different thread. I propose you to use custom allocators. – Dmitry Sazonov Dec 20 '16 at 21:48
  • Once I did next: 1) create another process that will do all necessary work and store result in a shared memory block. 2) destroy worker process for fast deallocation. 3) use shared memory in main process. – Dmitry Sazonov Dec 20 '16 at 21:50
  • @Dimitry:apparently the slow part is the deallocation, which doesn't happen in parallel. – MikeMB Dec 20 '16 at 22:55
  • @1201ProgramAlarm You're exactly right. To test this I called `std::random_shuffle(chunklist.begin(), chunklist.end())` before deallocation and the deletion took a whopping 73 seconds on average, even worse than the multithreaded example. Turn your comment into an answer and I'll accept it. – Phlucious Dec 21 '16 at 22:07

5 Answers5

7

It is pretty much speculations, but I presume the OS memory manager would be one system wide, after all it does service one pool of virtual memory, so throwing more threads at it will not improve speed, it will just choke it with overhead. Thread safety coupled with concurrent access always comes with a penalty. So the more threads you throw at it the more penalty you will get.

30M allocations is quite a lot, regardless of the size of the allocations, and it also represents a significant overhead memory consumption wise. I'd recommend you invest the time to implement memory pools, preallocating monolithic chunks of memory and using placement new to allocate objects inside those pools. This will be a tremendous CPU time saver and a significant memory saver as well. Additionally, it will increase cache friendliness and cache hits by reducing fragmentation.

To put it as a metaphor, putting 4 cooks on a single stove won't make cooking 4 times faster, it will make each cook at least 4 times slower plus the time they will waste in conflict of resource usage. That's pretty much what you are seeing in practice.

dtech
  • 47,916
  • 17
  • 112
  • 190
  • So the fact that the objects occupy non-sequential memory blocks could be making the heap deallocator "jump around" because they were allocated in a multithreaded environment? I don't understand the heap well enough to follow what, exactly, `placement new` is doing. – Phlucious Dec 20 '16 at 19:23
  • The OS memory manager has nothing to do with anything, because it's only used to allocate fairly large chunks of memory that are then subdivided by the memory allocator proper. The latter is in the runtime of the programming environment you're using - usually the C runtime. – Kuba hasn't forgotten Monica Dec 20 '16 at 19:24
  • 2
    @Phlucious - only computations can benefit from multithreading. So using more threads for memory management can only result in a significant performance hit as different threads are fighting for synchronization. Placement new or custom allocators together with memory pools can reduce the number of allocations significantly, and you place multiple objects into one big chunk of memory. This results in significant savings in both CPU time and memory usage. On top of that it might also be more cache-friendly since data is less fragmented. – dtech Dec 20 '16 at 22:25
  • The interesting thing is that the slowdown occurs on deal location, when there is only a single thread anymore. – MikeMB Dec 20 '16 at 22:51
  • @ddriver I edited my post to add some simplistic sample code and performance metrics, if that helps put it into context. Does this look like what you described? – Phlucious Dec 21 '16 at 16:53
  • @Phlucious - I am yet to find the time to investigate the matter, and being quite busy I am not sure when will I find the time to dig into this. It may be the runtime, it may be QtConcurrent, it may be QMap, some wacky combination or who knows what. There is definitely something fishy. For the time being I will suggest the obvious - don't use multiple threads ;) Also, and again, memory pools and custom allocators. – dtech Dec 22 '16 at 03:53
2

(updating comment to answer)

It could be because with one thread all the allocations are sequential, so the frees are as well. With the multithreaded allocations, they are more intermixed so free needs to do more work to clean up after each deallocation.

1201ProgramAlarm
  • 32,384
  • 7
  • 42
  • 56
  • 1
    Confirmed. Randomizing the order of `MyObject` before deallocation resulted in a massive slowdown, even if points were created in a single thread. This is exacerbated significantly with multithreaded QMap insertions because it creates its nodes on the heap, further fragmenting the objects, at least on my compiler and Qt 4.8. – Phlucious Dec 22 '16 at 16:26
1

When allocating on a single memory pool from multiple threads you'll create a bottleneck during deallocation because the units being deleted sequentially are nonadjacent.

If you are using fixed size allocations, you should be able to leverage that into O(1) type performance in your allocator/dealloctor. A unit allocation system that places a bunch of blocks of the same size into a free list and then pushes/pops them as needed is something you should look into.

Phlucious
  • 3,704
  • 28
  • 61
Michael Dorgan
  • 12,453
  • 3
  • 31
  • 61
  • I'm unclear on the meaning of "unit allocation system". Are you talking about managing the new/delete operations with a factory object that pre-allocates the objects and then passes them to the worker threads? – Phlucious Dec 20 '16 at 19:07
  • 1
    I'm talking about a memory allocation system that takes a memory pool and stores a bunch of pointers into it, spaced by the requested size, an place those pointers into a stack. When alloc is requested, pop a pointer and return it. When free happens, push the pointer back onto stack. O(1) alloc and free. The cost is that your block size must equal or larger than the requested memory size so you could be introducing internal fragmentation if it is used for anything besides the "80 byte" structures you mention. – Michael Dorgan Dec 21 '16 at 00:18
  • I've used allocators like this for many games over the years and the resulting speed up is startling. – Michael Dorgan Dec 21 '16 at 00:20
  • When you implement this, are you pre-allocating N objects on the main thread and then passing pointers to those objects, or are you allocating a "blank" memory pool with enough space for N objects (N*80 in my case) and using placement new? – Phlucious Dec 21 '16 at 17:00
  • 1
    Either could work. In my case, I was integrating this into our main memory allocator (replacing new/delete/malloc/free/etc.) and putting code into there that said "if size is less than N and there are available unit blocks, use unit allocator. else use whatever you would normally." In this way, I believe it is closer to your second case, though because I replace the allocators, I do not need to use placement new. In Windows, the placement new or pointer method might be easier though :) – Michael Dorgan Dec 21 '16 at 18:46
1

Memory allocation and free is known to be slow, OS is sequencing the access of memory. This sequencing make new and free thread safe but also slow down things considerably.

It is common practice to pre-allocate large block of memory if each piece is fixed sized.

Another way is to use memory mapped files to bypass allocation. Qt has memory mapped file class can be used in all platform. You can try this approach, How do you serialize a QMap?

Community
  • 1
  • 1
m. c.
  • 867
  • 5
  • 12
0

I'd be tempted to allocate a relatively large block of memory for each thread and within that thread try to use it as if it were a stack (or as a circular buffer). This potentially works nicely if you are always putting new objects in at one end and deleting them from the other. Or if you can de-allocate a group of objects in one step (as happens with a stack when a function call returns). Otherwise you do need the new and delete functionality that you get from a heap, which as you've discovered can be a major performance bottleneck in some cases.

Edit: I do think we are missing the point though. It doesn't really make sense that your deletes at the end are so slow. If I understand the code correctly at that point you've only got the main thread running?

  • When I track the program with Very Sleepy, it looks like the QtConcurrent call spawns four threads, which do their work and then go dormant until the parent (main) thread dies. Does it matter that they still exist? – Phlucious Dec 21 '16 at 16:49
  • It shouldn't, unless they were somehow still holding mutexes that are used internally by new and delete. Another possibility is that you've got some sort of heap corruption that is being caused by the usage of QMap, and it's the heap corruption that is causing the deletion to take much longer than it normally would. I'm clutching at straws here though. – Stuart Fisher Dec 22 '16 at 17:57