openmp: recursive task slower than sequential even with depth limit

Question

I have a very big binary tree on which I want to do certain expensive calculations. I want to parallelize these methods using openmp and its task pragmas.

As a test, I have parallelized the freeTree function and I have created a big example test tree. In order to prevent spawning many tasks, I have limited the tasks creation to the top two levels of the tree. So effectively only 4 tasks are created.

Below is the minimal working example:

#include <chrono>
#include <iostream>

class Node {
public:
    int data;
    Node* l;
    Node* r;

    Node(Node* left, Node* right) : l(left), r(right) {}
};

Node* createRandomTree(int depth) {
    if (depth == 0)
        return new Node(NULL, NULL);

    return new Node(createRandomTree(depth - 1), createRandomTree(depth - 1));
}

void freeTree(Node* tree) {
    if (tree == NULL) return;

    freeTree(tree->l);
    freeTree(tree->r);

    delete tree;
}

void freeTreePar(Node* tree, int n = 0) {
    if (tree == NULL) return;

    Node *l = tree->l, *r = tree->r;

    if (n < 2) {
        #pragma omp task
        freeTreePar(l, n + 1);
        #pragma omp task
        freeTreePar(r, n + 1);
    } else {
        freeTree(tree->l);
        freeTree(tree->r);
    }

    // taskwait is not necessary
    delete tree;
}

int main(int argc, char const *argv[])
{
    std::chrono::time_point<std::chrono::system_clock> start, end;

    Node* tree = createRandomTree(22);
    start = std::chrono::system_clock::now();

    #pragma omp parallel shared(tree)
    {
        #pragma omp single nowait
        freeTreePar(tree);
    }
    end = std::chrono::system_clock::now();

    std::chrono::duration<double> elapsed_seconds = end-start;
    std::time_t end_time = std::chrono::system_clock::to_time_t(end);

    std::cout << "finished computation at " << std::ctime(&end_time)
              << "elapsed time: " << elapsed_seconds.count() << "s\n";

    return 0;
}

When I run this code, it takes about 0.38 seconds to free the tree. However, if I just call freeTree(root) directly, it only takes 0.2 seconds. So even though the tree is very big (2^22 elements) and even though in this particular test case the tasks are of equal size, I cannot get a performance increase using parallel methods.

Am I doing something wrong? Do you know how I can improve this code? Thanks!

Axel Borja · Accepted Answer · 2014-01-31T14:32:56.690

4

Some tasks are not really parallelizable, because some resources are only accessible by one thread at a time (Thread Safety). And this is the case of dynamic memory.

malloc/free is now mostly thread safe.

=> A lock will be performed around each malloc/free.

So, You cannot easily improve this kind of code.

edited Jan 31 '14 at 14:32

answered Jan 31 '14 at 14:27

Axel Borja

3,718
7
36
50

There are specially designed (nearly) lock-free memory allocators that can be used as drop-in replacements for the one provided by (g)libc. Some of them are listed in [this question](http://stackoverflow.com/questions/147298/multithreaded-memory-allocators-for-c-c). – Hristo Iliev Feb 01 '14 at 10:46

openmp: recursive task slower than sequential even with depth limit

1 Answers1