0

I have many functions that loop over data stored in a 4D array. We are using some OpenMP to help iterate over this data in locations where it makes most sense and we are not overwriting things.

For instance I have the following snippet of code:

void MaximumCombiner::createComplexData(double**** input, const int nX, const int nY,
                                        const int nZ, const int nA, std::complex<double>**** output,
                                        const double* beamData)
{
    for(int iX = 0; iX < nX; ++iX)
    {
        for(int iY = 0 ; iY < nY ; ++iY)
        {
            std::complex<double> complexArg(0.0, (beamData[iY] * M_PI / 180.0));
            std::complex<double> complexExp = std::exp(complexArg);

            for(int iZ = 0; iZ < nZ; ++iZ)
            {
                for(int iA = 0 ; iA < nA ; ++iA)
                {
                    output[iX][iY][iZ][iA] = input[iX][iY][iZ][iA] * complexExp;
                }
            }
        }
    }
}

Originally, I thought I should add a #pragma omp parallel for before each for-loop but now I am wondering if I am spending more time with overhead of creation/deletion of threads than actual work. I also have tried out using #pragma omp parallel above the first for-loop and stuck a #pragma omp for on one of the inner loops, but am not sure if this is best either. What should I look for when trying to decide where to place my OMP calls?

Todd
  • 90
  • 6
  • 1
    Woo! 4 star programming! – user4581301 May 10 '16 at 21:05
  • 2
    Profile the various options and see what's empirically best. Making assumptions about the overhead of thread creation without hard numbers is only going to lead to pain. – random passerby May 10 '16 at 22:32
  • What is your recommendation on how to profile the options? Just use Valgrind? Without the threading and see which chunks take the longest? – Todd May 10 '16 at 22:56
  • 1
    If you have valgrind, you'll have gprof, the gnu profiler. If you want to profile, I'd start there. I'd also take a look at what that `double**** input` is doing to your CPU's ability to cache. Cachegrind may be helpful here. Profile first, of course , but you may want to make a wrapper class around a 1D array of `nX * nY * nZ * nA` that presents itself to its users as a 4D array. At the very least, you only need to pass around one object instead of one mega pointer and four sizes. – user4581301 May 11 '16 at 00:02
  • As usual, **don't guess, measure** is the recommended approach. Usually the primary question for a profiling tool is: *Where is time spent*. You want to know how the time for this function changes with versions. This is more like a benchmarking question. Still, you can use most performance analysis tools (see [q1](https://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux) (not the most topvoted), [q2](https://stackoverflow.com/questions/26663/whats-your-favorite-profiling-tool-for-c)). You can also use a small benchmarking functionality with `omp_get_wtime`. – Zulan May 11 '16 at 07:07
  • Just put `#pragma omp parallel for` right before the outermost loop and see how it works. You could also try `#pragma omp parallel for collapse(2)` (or move `complexArg` and `complexExp` inside the innermost loop, your compiler should be smart enough to not recalculate them every iteration, and use `collapse(4)`) but I doubt it will make a difference. In any case what you are doing is very likely memory bandwidth bound so it won't scale with the number of cores anyway. – Z boson May 12 '16 at 07:45
  • Silly question maybe, but how are you using multi-dimension array notation with a pointers? There is nothing that says your mulit-dimensional array is one contiguous block of memory. An how is the compiler suppose to know what the dimensions are? – Z boson May 12 '16 at 07:51

1 Answers1

0

Generally the straightforward way is to put only one #pragma omp for on the most outer for-loop. This gives you smallest thread creation/deletion overhead.

However this overhead may not be the bottleneck. So you still need to profile you code to find out the best way, as suggested by the comments.

kangshiyin
  • 9,681
  • 1
  • 17
  • 29