Multi-Threading Loop Through 4D Array C++

Question

I have many functions that loop over data stored in a 4D array. We are using some OpenMP to help iterate over this data in locations where it makes most sense and we are not overwriting things.

For instance I have the following snippet of code:

void MaximumCombiner::createComplexData(double**** input, const int nX, const int nY,
                                        const int nZ, const int nA, std::complex<double>**** output,
                                        const double* beamData)
{
    for(int iX = 0; iX < nX; ++iX)
    {
        for(int iY = 0 ; iY < nY ; ++iY)
        {
            std::complex<double> complexArg(0.0, (beamData[iY] * M_PI / 180.0));
            std::complex<double> complexExp = std::exp(complexArg);

            for(int iZ = 0; iZ < nZ; ++iZ)
            {
                for(int iA = 0 ; iA < nA ; ++iA)
                {
                    output[iX][iY][iZ][iA] = input[iX][iY][iZ][iA] * complexExp;
                }
            }
        }
    }
}

Originally, I thought I should add a #pragma omp parallel for before each for-loop but now I am wondering if I am spending more time with overhead of creation/deletion of threads than actual work. I also have tried out using #pragma omp parallel above the first for-loop and stuck a #pragma omp for on one of the inner loops, but am not sure if this is best either. What should I look for when trying to decide where to place my OMP calls?

Profile the various options and see what's empirically best. Making assumptions about the overhead of thread creation without hard numbers is only going to lead to pain. — random passerby, May 10 '16 at 22:32
What is your recommendation on how to profile the options? Just use Valgrind? Without the threading and see which chunks take the longest? — Todd, May 10 '16 at 22:56
If you have valgrind, you'll have gprof, the gnu profiler. If you want to profile, I'd start there. I'd also take a look at what that `double**** input` is doing to your CPU's ability to cache. Cachegrind may be helpful here. Profile first, of course , but you may want to make a wrapper class around a 1D array of `nX * nY * nZ * nA` that presents itself to its users as a 4D array. At the very least, you only need to pass around one object instead of one mega pointer and four sizes. — user4581301, May 11 '16 at 00:02
As usual, **don't guess, measure** is the recommended approach. Usually the primary question for a profiling tool is: *Where is time spent*. You want to know how the time for this function changes with versions. This is more like a benchmarking question. Still, you can use most performance analysis tools (see [q1](https://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux) (not the most topvoted), [q2](https://stackoverflow.com/questions/26663/whats-your-favorite-profiling-tool-for-c)). You can also use a small benchmarking functionality with `omp_get_wtime`. — Zulan, May 11 '16 at 07:07
Just put `#pragma omp parallel for` right before the outermost loop and see how it works. You could also try `#pragma omp parallel for collapse(2)` (or move `complexArg` and `complexExp` inside the innermost loop, your compiler should be smart enough to not recalculate them every iteration, and use `collapse(4)`) but I doubt it will make a difference. In any case what you are doing is very likely memory bandwidth bound so it won't scale with the number of cores anyway. — Z boson, May 12 '16 at 07:45
Silly question maybe, but how are you using multi-dimension array notation with a pointers? There is nothing that says your mulit-dimensional array is one contiguous block of memory. An how is the compiler suppose to know what the dimensions are? — Z boson, May 12 '16 at 07:51

kangshiyin · Answer 1 · 2016-05-12T18:00:02.670

0

Generally the straightforward way is to put only one #pragma omp for on the most outer for-loop. This gives you smallest thread creation/deletion overhead.

However this overhead may not be the bottleneck. So you still need to profile you code to find out the best way, as suggested by the comments.

edited May 12 '16 at 18:00

answered May 11 '16 at 19:27

kangshiyin

9,681
1
17
29

Multi-Threading Loop Through 4D Array C++

1 Answers1