The new operator + OpenMP dynamic schedule clause

Question

I've been working in a C++ code to perform quantum chemistry, atomic and molecular tasks in which is implied lots of work with arrays (1D, 2D, 3D etc). And I have an entire class called array to handle this. Of course, from the very begining the most fundamental member functions are those to dynamically allocate memory for those arrays, to resize, or delete them.

data = new double **[row]();
#pragma omp for schedule(static) nowait
for(unsigned int i = 0; i < row; ++i)
{
    data[i] = new double *[column]();
}

Now what I am doing is to speed up these routines with OpenMP. For most part of the routines I've been using the schedule(static) nowait clasues to divide my loops into chunks of step/threads, since those chunks spends almost the same time to be handled by their thread.

But to loops like the one above, with several calls of the new operator, I have the (bad) feeling that chunks of theses loops do not takes the same time to execute in their threads, in the sense that I should consider to apply schedule(dynamic, chunk_size) instead.

Do you guys agree? the dynamic allocation is not such simple task and could be expensive, so chunks of dynamic allocations could differ on its time of execution.

Actually I am not sure if I am not doing any kind of mistake concerning stack fragmentation or things like that. Advices would be welcome.

PS.: I am using the nowait clause to try to minimize bottlenecks of implicit barriers.

Why don't you do the memory allocation up front? You know how much you need, so you can use a single `new` (or better: a single `std::vector` for all the threads). — Fred Foo, Apr 23 '14 at 14:50
Well, you know, since I have lots of math routines and the performance is the most important thing in these cases, I am avoiding the STL containers. Even knowing how powerfull they are. I am also invoking a couple of third libs that do prefer the C standard array. — Humberto Silva Jr., Apr 24 '14 at 20:40
An `std::vector` *is* a C array with a thin wrapper around it. If you allocate it outside the parallel for loop and keep it alive, you can use it inside the loop as a plain array. — Fred Foo, Apr 25 '14 at 12:28

Patrick · Accepted Answer · 2014-04-23T17:48:17.980

Your particular loop probably does not offer much opportunity for parallelism if you are using the default new operator, since the heap is a single resource and access to it needs to be serialized through a mutex. However, assuming you have other loops for which you wish to use OpenMP, the following should help.

From the OpenMP 3.1 specification:

static When schedule(static, chunk_size) is specified, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.

When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. Note that the size of the chunks is unspecified in this case.

dynamic When schedule(dynamic, chunk_size) is specified, the iterations are distributed to threads in the team in chunks as the threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed.

Each chunk contains chunk_size iterations, except for the last chunk to be distributed, which may have fewer iterations.

When no chunk_size is specified, it defaults to 1.

In your case, you are not specifying chunk_size, so the number of iterations for each task is unspecified.

In general, I prefer to have some control over the number of threads and the number of iterations each task is executing. I found (on Windows, compiled with mingw-w64), there is significant overhead for tasks starting a new chunk of work, so it is beneficial to give them as big of a chunk as possible. What I tend to do is use dynamic (though I could use static for fixed execution time tasks), and set the chunk_size to be the loop count divided by the number of threads. In your case, if you suspect uneven task execution time, you could divide this by 2 or 4.

// At the top of a C++ file:
static int NUM_THREADS = omp_get_num_procs();

// Then for your loop construct (I'm using a combined parallel for here):
#pragma omp parallel for num_threads(NUM_THREADS) \
   schedule(dynamic, row / NUM_THREADS / 2)
for(unsigned int i = 0; i < row; ++i)
{
   data[i] = new double *[column]();
}

Note also that if you do not set num_threads, the default will be nthreads-var, which is determined from omp_get_max_threads.

Regarding the nowait clause, obviously make sure you are not using data outside your loop construct. I'm using a combined parallel loop construct above, which means nowait cannot be specified.

Thanks Patrick, and sorry the delay to thank you. I was completely unaware of this single-nature of the heap. It seems that I need to study more. So, just with this info you solve all the problem because I am not going to waste time with the dynamic allocation loops. — Humberto Silva Jr., Apr 24 '14 at 20:26
Concerning the chunk_size, I used to do the same, steps divided by number of threads. Actually, most part of my routines are using static schedule (the compiler may use the same formula to find the chunk_size) because (1) the chunks spents almost the same execution time and (2) to minimize any bottleneck concerning the thread sync of dynamic schedule. Here my (now solved) problem was just to realize if a bunch of dynamic allocations indeed has different execution times or not, since from my point of view dynamic allocations are really dependent of the current environment. — Humberto Silva Jr., Apr 24 '14 at 20:27

score 0 · Answer 2 · answered Apr 25 '14 at 12:36

There's a much cleaner way to do all of this:

std::vector<double> actual_data(omp_get_num_procs() * column);

std::vector<double *> data(omp_get_num_procs());
for (unsigned i = 0; i < row; i++) {
    data[i] = &(actual_data[i * column]);
}

You now have a single array, allocated in one go, with an array of pointers into that array. Inside parallel algorithms, you can use data[i][j] to get to the member you want, with practically zero-overhead (the overhead occurs at compile-time).

The only potential risk is that of false sharing because the rows of your matrix might share a cacheline at their endpoints.

Memory management is automatic; no need to free anything.

score 0 · Answer 3 · edited May 23 '17 at 11:57

If you plan on doing a lot of basic linear algebra operations (BLAS) then I advice you NOT to use arrays of arrays for multidimensional arrays. Unfortunately the C syntax for multidimensional static and dynamic arrays have broken symmetry. This has led Java and C# to use the same syntax so that people often think when they allocate a 2D array they are getting the same thing as a static array but from the heap instead of the stack. Your 2D arrays are actually Jagged arrays.

From my experience with BLAS operations (as well as image processing) you want contiguous chunks of memory for 2D arrays. So for a nxn matrix you should allocated like double[n*n] and access it as data[i*n+j] or create a C++ matrix class which where you can access the matrix like matrix.get(i,j).

Don't allocate the memory in parallel. Most BLAS operations are going to be memory bound anyway so OpenMP is only going to be useful for Level-3 BLAS operations such as matrix multiplication, LU factorization, cholesky decomposition which go as O(n^3).

Hey! My piece of code, above, was just to show as an (straightforward) example. What I actually did for 2D arrays is, as you said, a huge common vector of shape array[row*column]. And, yes, I am using CBLAS. Thanks for advise! — Humberto Silva Jr., Apr 25 '14 at 13:28
@HumbertoSilvaJr., in that case I don't understand your question. You need to give more details, maybe a particular code fragment, of what you're trying to do. — Z boson, Apr 25 '14 at 13:34

The new operator + OpenMP dynamic schedule clause

3 Answers3