I have a very large array of ~30M objects approximately 80bytes apiece – that's ~2.2GB for those following along – stored on the disk. The actual size of each object varies a little because each one has a QMap<quint32, QVariant>
child.
Unpacking those objects from raw data is expensive, so I've implemented a multithreaded read operation that pulls a few MB from disk sequentially and then passes each raw data block to a thread to get unpacked in parallel via QtConcurrent
. My objects are created (via new
) on the heap inside the working threads and then passed back to the main thread for the next step. Upon completion, these objects are deleted on the main thread.
In a single-threaded environment, this deallocation is relatively fast (~4-5 seconds). However, when multithreaded on 4 threads this deallocation is incredibly slow (~26-36 seconds). Profiling this with Very Sleepy indicates that the slowdown is in MSVCR100 free
, so it's the deallocation itself that is slow.
Searching around SO suggests that allocating and deallocating on different threads is safe. What is the source of the slowdown, and what can I do about it?
Edit: Some sample code communicating the idea of what's going on: For the sake of troubleshooting, I have completely removed the disk IO from this example and simply create the objects and then delete them.
class MyObject
{
public:
MyObject() { /* set defaults... irrelevant here */}
~MyObject() {}
QMap<quint32, QVariant> map;
//...other members
}
//...
QList<MyObject*> results;
/* set up the mapped lambda functor (QtConcurrent reqs std::function if returning) */
std::function<QList<MyObject*>(quint64 chunksize)>
importMap = [](quint64 chunksize) -> QList<MyObject*>
{
QList<MyObject*> objs;
for(int i = 0; i < chunksize; ++i)
{
MyObject* obj = new MyObject();
obj->map.insert(0, 1); //ran with and without the map insertions
obj->map.insert(1, 2);
objs.append(obj);
}
return objs;
}; //end import map lambda
/* set up the reduce lambda functor */
auto importReduce = [&results](bool& /*noreturn*/, const QList<MyObject*> chunkimported)
{
results.append(chunkimported);
}; //end import reduce lambda
/* chunk up the data for import */
quint64 totalcount = 31833986;
quint64 chunksize = 500000;
QList<quint64> chunklist;
while(totalcount >= chunksize)
{
totalcount -= chunksize;
chunklist.append(chunksize);
}
if(totalcount > 0)
chunklist.append(totalcount);
/* create the objects concurrently */
QThreadPool::globalInstance()->setMaxThreadCount(1); //4 for multithreaded run
QElapsedTimer tnew; tnew.start();
QtConcurrent::mappedReduced<bool>(chunklist, importMap, importReduce, QtConcurrent::OrderedReduce | QtConcurrent::SequentialReduce);
qDebug("DONE NEW %f", double(tnew.elapsed())/1000.0);
//do stuff with the objects here
/* delete the objects */
QElapsedTimer tdelete; tdelete.start();
qDeleteAll(results);
qDebug("DONE DELETE %f", double(tdelete.elapsed())/1000.0);
Here are the results with and without inserting data to MyObject::map, and with 1 or 4 threads available to QtConcurrent:
- 1 Thread:
tnew
= 2.7 seconds;tdelete
= 1.1 seconds - 4 Threads:
tnew
= 1.8 seconds;tdelete
= 2.7 seconds - 1 Thread + QMap:
tnew
= 8.6 seconds;tdelete
= 4.6 seconds - 4 Threads + QMap:
tnew
= 4.0 seconds;tdelete
= 48.1 seconds
In both scenarios it takes significantly longer to delete the objects when they were created in parallel on 4 threads vs. in serial on 1 thread, which was further exacerbated by inserting to QMap in parallel.