World,
I try to run an C++ application (compiled in VS as .exe) with multiple threads and use QThread or omp-parallelization for this. Each thread does multiple allocations/deallocations of memory to perfrom large matrix computations before solving equation systems built from these matrices with umfpack. Now, when I use too many threads, I loose performance because the threads are blocking each other while doing this. I already read that memory (de)-allocation is possible only for one thread at a time (like a mutex condition).
What I have tried already:
- deacrease large reallocations as best I could
- use different parallelization methods (Qt vs. omp)
- randomly changing the reserved and committed stack/heap size
- making umfpack arrays threadprivate
In my setup, I am able to use ~4 threads (each thread uses ~1.5 GB RAM) before performance decreases. Interestingly - but something I couldn't wrap my head around yet - the performace is reduced only after a couple of threads finished and new ones are taking over. Note also that threads are not depended from each other, there are no other blocking conditions, and each thread runs roughly the same amount of time (~2min).
Is there an "easy way" - e.g. setting up heap/stack in a certain way - to solve this issue?
Here are some code snippets:
// Loop to start threads
forever
{
if (sem.tryAcquire(1)) {
QThread *t = new QThread();
connect(t, SIGNAL(started()), aktBer, SLOT(doWork()));
connect(aktBer, SIGNAL(workFinished()), t, SLOT(quit()));
connect(t, SIGNAL(finished()), t, SLOT(deleteLater()));
aktBer->moveToThread(t);
t->start();
sleep(1);
}
else {
//... wait for threads to end before starting new ones
//... eventually break
}
qApp->processEvents();
}
void doWork() {
// Do initial matrix stuff...
// Initializing array pointers for umfpack-lib
static int *Ap=0;
static int *Ai=0;
static int *Ax=0;
static int *x=0;
static int *b=0;
// Private static Variablen per thread
#pragma omp threadprivate(Ap, Ai, Acol, Arow)
// Solving -> this is the part where the threads block each other, note, that
there are other functions with matrix operations, which also (de-)/allocate a
lot
status = umfpack_di_solve (UMFPACK_A, Ap,Ai,Ax,x,b, /*...*/);
emit(workFinished());
}