0

World,

I try to run an C++ application (compiled in VS as .exe) with multiple threads and use QThread or omp-parallelization for this. Each thread does multiple allocations/deallocations of memory to perfrom large matrix computations before solving equation systems built from these matrices with umfpack. Now, when I use too many threads, I loose performance because the threads are blocking each other while doing this. I already read that memory (de)-allocation is possible only for one thread at a time (like a mutex condition).

What I have tried already:

  • deacrease large reallocations as best I could
  • use different parallelization methods (Qt vs. omp)
  • randomly changing the reserved and committed stack/heap size
  • making umfpack arrays threadprivate

In my setup, I am able to use ~4 threads (each thread uses ~1.5 GB RAM) before performance decreases. Interestingly - but something I couldn't wrap my head around yet - the performace is reduced only after a couple of threads finished and new ones are taking over. Note also that threads are not depended from each other, there are no other blocking conditions, and each thread runs roughly the same amount of time (~2min).

Is there an "easy way" - e.g. setting up heap/stack in a certain way - to solve this issue?

Here are some code snippets:

// Loop to start threads

forever
{
    if (sem.tryAcquire(1)) {
        QThread *t = new QThread();
        connect(t, SIGNAL(started()), aktBer, SLOT(doWork()));
        connect(aktBer, SIGNAL(workFinished()), t, SLOT(quit()));
        connect(t, SIGNAL(finished()), t, SLOT(deleteLater()));
        aktBer->moveToThread(t);
        t->start();
        sleep(1);
    }
    else {
        //... wait for threads to end before starting new ones
        //... eventually break
    }
    qApp->processEvents();
}

void doWork() {
    // Do initial matrix stuff...
    
    // Initializing array pointers for umfpack-lib
        static int *Ap=0;
        static int *Ai=0;
        static int *Ax=0;
        static int *x=0;
        static int *b=0;

    // Private static Variablen per thread
    #pragma omp threadprivate(Ap, Ai, Acol, Arow)

    // Solving -> this is the part where the threads block each other, note, that 
              there are other functions with matrix operations, which also (de-)/allocate a 
              lot
    status = umfpack_di_solve (UMFPACK_A, Ap,Ai,Ax,x,b, /*...*/);
    
    emit(workFinished());
}
  • You could try to preallocate into pools, or switch to a different allocator that doesn't serialize all allocations and deallocations. See https://stackoverflow.com/q/147298/103167 – Ben Voigt Mar 17 '22 at 17:27
  • Thank you. Would it be sufficient to use a new allocator to instanciate the thread objects or would I have to exchange all "new" statements in my code? – Thaddäus Kreisig Mar 17 '22 at 17:38
  • A good allocator will have an option to replace the system allocator (in C++ it is named `::operator new()`) so you don't have to rewrite code. Based on your statement that the contention happens in the matrix operations, simply changing allocation of the Thread object would not be enough. – Ben Voigt Mar 17 '22 at 17:45
  • For example Hoard says ["No source code changes necessary"](http://hoard.org) – Ben Voigt Mar 17 '22 at 17:46
  • Reminder - there is a third choice - static . You can just reserve a honking big array in static data – pm100 Mar 17 '22 at 17:47
  • @pm100: No, not when "each thread uses 1.5 GB RAM" – Ben Voigt Mar 17 '22 at 17:47
  • @BenVoigt - yesterday I was answering a question where OP wanted a ~100 gb 2d array, a few 1.5 gb aint nothing :-) – pm100 Mar 17 '22 at 17:59
  • @pm100: Even so, it's too large for static allocation. Now, static memory layout in a single object that becomes a single dynamic allocation, yes. But a multi-GB `.bss` segment is a really bad idea. – Ben Voigt Mar 17 '22 at 18:43
  • Hi @BenVoigt, I tried mimalloc now, but ongestion remains during allocation of vector. Is it possible to use mimalloc for that, too? – Thaddäus Kreisig Apr 19 '22 at 15:06
  • @ThaddäusKreisig: One of the template parameters for `vector` is the `allocator` object to use. If you are trying out an allocator that doesn't replace the system allocator but has to be called explicitly, that allocator parameter is how you configure `std::vector` to do so. – Ben Voigt Apr 19 '22 at 15:25
  • Ah ok. I tried to override th system allocator though, but was wondering if that went wrong, therefore I wanted to ensure that I am using the mimalloc allocator for particular vectors. Thanks for your response. – Thaddäus Kreisig Apr 19 '22 at 15:41

1 Answers1

0

For those who are interested in my solution: I included another allocator in my app (as @Ben Voigt suggested). In my case, I chose mimalloc as it seems to get regular maintanance (even by microsoft itself) and can be included pretty easily. See here: Mimalloc

  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/32132390) – Rohit Gupta Jul 06 '22 at 03:40