Code of non-overlapping spheres not efficient enough

Question

I am writting a code for eliminating or making zero those spheres defined by coordinates (xcentro, ycentro, zcentro) whish overlap somewhere. The radius of each sphere is in vector r. However, the code is not efficient enough and I need some help to optimize it. The code is the following:

vector<double> xcentro, ycentro, zcentro, r;

 #pragma omp parallel for num_threads(4)
    for(int i=0; i<r.size()-1; i++)
    {
        for(int j=0; j<r.size()-1; j++)
        {
            //d.insert(d.begin() + i, sqrt(pow(xcentro[i] - xcentro[j], 2) + pow(ycentro[i] - ycentro[j], 2) + pow(zcentro[i] - zcentro[j], 2)));
            d[i] = sqrt(pow(xcentro[i] - xcentro[j], 2) + pow(ycentro[i] - ycentro[j], 2) + pow(zcentro[i] - zcentro[j], 2));
            if ((d[i] < (r[i] + r[j])) && (r[i] >= r[j])&&(i!=j))
            {
                //hacer 0 la esfera j-esima
                r[j] = 0.0;
                xcentro[j] = 0.0;
                ycentro[j] = 0.0;
                zcentro[j] = 0.0;
                cout << "a" << endl;
            }
            else if ((d[i] < (r[i] + r[j])) && (r[i] < r[j]))
            {
                //hacer 0 la esfera i-esima
                r[i] = 0.0;
                xcentro[i] = 0.0;
                ycentro[i] = 0.0;
                zcentro[i] = 0.0;
                cout << "b" << endl;
            }
            else
            {
                cout << "c" << endl;
            }
            cout << "i: " << i << "j: " << j << endl;
        }
    }

any help? And how can I delete then the elements whose radius is 0?

try to avoid `pow` and `sqrt`. Both can be expensive as hell. Often you can multiply away both functions. In a `sqrt`ed value used in a comparison, you square the other number. With `pow` and an integer power, you multiply. In this case I you may be stuck with the `sqrt` because you're storing the result, but the `pow`s can all be replaced. — user4581301, Aug 03 '21 at 17:09
@user4581301 yes, I can't get rid of the sqrt, but I can get rid of pows. However, I did the change and it didn't mean any significant change in computional time. — Elena, Aug 03 '21 at 17:14
Compilers are smart. Probably recognized what you were doing and already replaced the function calls. If your timed runs include the printing, `cout << "a" << endl;` and friends, replace the of the `endl`s with `'\n'`s. No need to constantly flush. I th implementation might do it anyway, but one less potential timesink to worry about — user4581301, Aug 03 '21 at 17:18
Without knowing your data it is quite difficult to recommend. For example it could be possible to eliminate some checks based on how far they are by each coordinate. For example if dx or dy or dz is already bigger than sum of radiuses it is not necessary to check further. So you can create indexes based on coordinates and only check spheres that close enough. — Slava, Aug 03 '21 at 17:23
IMHO, you should use constant temporaries to hold frequently used variables or results of expressions. For example, `r[i]` and `r[j]` are frequently used, so place them into constant temporaries. The compiler may do this at higher optimization levels. The objective is to get the compiler to use registers for the temporaries. — Thomas Matthews, Aug 03 '21 at 17:23
Instead of using parallel arrays, use a vector of structs. E.g. `struct Point3d { int x, y, z;}; std::vector database;` This will provide a higher probability that `x[i], y[i]` and `z[i]` are in the same cache line. Reloading the cache is a waste of time. — Thomas Matthews, Aug 03 '21 at 17:26
You may want to use a single `if`, `(d[i] < (r[i] + r[j])` and then inside the `if`, use additional `if`s for the `&&` expression. Thus you only evaluate the expression `(d[i] < (r[i] + r[j])` once. — Thomas Matthews, Aug 03 '21 at 17:29
You could use a `string` variable or a `char` variable to contain the evaluation kind, e.g. 'a', 'b', or 'c'. Then above the line that prints `i` and `j`, print the evaluation kind variable. This may not increase execution efficiency, but will probably decrease your code size. — Thomas Matthews, Aug 03 '21 at 17:33
@user4581301 you are telling me to change endl; by '\n' ?? That really changes the efficience? I'm surprised :o — Elena, Aug 03 '21 at 17:33
Interestingly, you may want to place 0.0 into a variable. On some architectures the compiler would always access 0.0 from the constants area in the code, for each assignment. Using a variable for the 0.0, the compiler would load the constant into a register, then assign the memory locations using the variable (which is a lot faster). This may be applied with higher optimization settings, but you'll have to look at the assembly language to confirm. I got this from viewing the ARM assembly language generated by the IAR compiler in debug mode. — Thomas Matthews, Aug 03 '21 at 17:37
To gain more efficiency, you should redesign your algorithm. Redesign the algorithm in the perspective of *data flow* and not *execution* flow. — Thomas Matthews, Aug 03 '21 at 17:39
If you take the absolute value of `r[i]` and `r[j]` before `if ((d[i] < (r[i] + r[j])) ....` your single comparison and reduce to `if (i != j && d[i] < (r[i] + r[j]))`. You only care whether the sum of the radii are greater than the distance between the spheres in determining whether the spheres intersect. — David C. Rankin, Aug 03 '21 at 17:48
Have you considered to start the inner loop at `j = i + 1` to avoid considering each couple twice? — Bob__, Aug 03 '21 at 17:54
[“std::endl” vs “\n”](https://stackoverflow.com/questions/213907/stdendl-vs-n). TL;DR version: `endl` is a newline AND a stream flush. Flushing a stream [can get expensive](https://stackoverflow.com/a/14107357/4581301). — user4581301, Aug 03 '21 at 18:07
@user4581301 any help with openmp? If I delete that line nothing seems to change on computational time — Elena, Aug 03 '21 at 20:01
Too little familiarity with OpenMP to offer help. Make sure your compiler supports it and if supported, make sure you have it turned on. — user4581301, Aug 03 '21 at 20:18
@ElenaFernandez You have asked several questions recently, but you have never accepted nor upvoted any answers. Please do not forget to do so. — Laci, Aug 04 '21 at 04:50
@Slava I can't calculate dx dy and dz, It's beyond the limits with the data I have — Elena, Aug 04 '21 at 09:03
@ThomasMatthews I disagree with using an array of structs instead of multiple vectors. This would make vectorization impossible. — paleonix, Aug 04 '21 at 11:34
@PaulG. Vectorization is still possible. Each struct instance is a vector of memory (in terms of SIMD instructions). So, a swap becomes memory copies of a vectors. Versus parallel arrays of vectors in which the 3 elements are in random places in memory and SIMD instructions cannot be used (they can be used, but will only involve one element). — Thomas Matthews, Aug 04 '21 at 15:23
@ThomasMatthews I mean "impossible" is hyperbole, but to cite [Wikipedia](en.wikipedia.org/wiki/AoS_and_SoA): "SIMD ISAs are usually designed for homogeneous data [...].". Take for example the distance calculation: Every step (-, *, +, sqrt) can be parallelized with SoA. With AoS you might be able to parallelize the - and * and maybe the +, but definitely not the sqrt. Also: Shouldn't the hardware prefetcher take care of the regular access even in 4 different locations? In the end one has to benchmark both. I just wanted to protest the notion that AoS is trivially better. — paleonix, Aug 04 '21 at 16:21
@ThomasMatthews Also your second sentence is certainly wrong. This is not how SIMD is applied in most cases. — paleonix, Aug 04 '21 at 16:30

Ext3h · Answer 1 · 2021-08-03T20:42:11.213

Still the original O(n²) complexity, but cleaned up and some of the most trivial shortcuts implemented:

#include <vector>

struct Sphere
{
    double x, y, z, r;

    inline constexpr bool overlaps_with(const Sphere& other) const noexcept
    {
        const auto dx = other.x - x;
        const auto dy = other.y - y;
        const auto dz = other.z - z;
        const auto r_sum = other.r + r;
        if(dx > r_sum || dy > r_sum || dz > r_sum) return false;
        const auto d_squared = dx*dx + dy*dy + dz*dz;
        const auto r_squared = r_sum*r_sum;
        return d_squared <= r_squared;
    }
};

// const correctness in the signature is essential!
// otherwise you will disable lots of compiler optimizations around the access to spheres
void cull(const std::vector<Sphere> &spheres, std::vector<bool> &culled)
{
    for(int i=0; i < spheres.size(); ++i)
    {
        const auto& sphere = spheres[i];

        bool any_culled = false;
        // only compare to any other sphere *once*
        int j = i - 1;
        // try to find anything which culls self first
        for(; j > 0; --j)
        {
            // for doing so, probe against everything
            if(spheres[j].overlaps_with(sphere))
            {
                culled[i] = true;
                culled[j] = true;
                any_culled = true;
                break;
            }
        }
        // if the current sphere got culled, still need to check if it culls something else in return
        if(any_culled)
        {
            for(; j > 0; --j)
            {
                // this can be short-circuited now to exclude already culled spheres
                if(!culled[j] && spheres[j].overlaps_with(sphere))
                {
                    culled[j] = true;
                }
            }
        }
    }
}

As for compiler options, -ffast-math should be enabled, in order to allow some essential reordering of floating point operations. And really, really consider twice if you actually need double precision.

ny help with openmp? If I delete that line nothing seems to change on computational time — Elena, Aug 03 '21 at 20:02
@ElenaFernandez If you add OpenMP you will have to deal with synchronization instead. E.g. substitute `bool` by `std::atomic_bool`, which reduces the performance gains from the algorithmic optimizations. — Ext3h, Aug 03 '21 at 20:16
sorry but I don't understand... I have never used OpenMP and I need to use it now to make my code "faster" — Elena, Aug 03 '21 at 20:18
@ElenaFernandez Before you even think about using OpenMP, check if you can pick a better algorithm for your task. Even just avoiding `sqrt` saves more than you can make up with OpenMP. And if you use OpenMP, thread synchronisation is something you have to known how to do first. — Ext3h, Aug 03 '21 at 20:23

score 0 · Answer 2 · answered Aug 03 '21 at 18:37

0

Taking into account some of the answers, the final code would be:


    #pragma omp parallel for num_threads(4)
    for (i = 0; i < r.size() - 1; i++)
    {
        for (j = i+1; j < r.size() - 1; j++)
        {
            //d.insert(d.begin() + i, sqrt(pow(xcentro[i] - xcentro[j], 2) + pow(ycentro[i] - ycentro[j], 2) + pow(zcentro[i] - zcentro[j], 2)));
            d[i] = sqrt((xcentro[i] - xcentro[j]) * (xcentro[i] - xcentro[j]) + (ycentro[i] - ycentro[j]) * (ycentro[i] - ycentro[j]) + (zcentro[i] - zcentro[j]) * (zcentro[i] - zcentro[j]));
            if ((d[i] < (r[i] + r[j])) && (r[i] >= r[j]) && (i != j))
            {
                if ((r[i] >= r[j]) && (i != j))
                {
                    //hacer 0 la esfera j-esima
                    r[j] = 0.0;
                    xcentro[j] = 0.0;
                    ycentro[j] = 0.0;
                    zcentro[j] = 0.0;
                }
                if (r[i] < r[j])
                {
                    //hacer 0 la esfera i-esima
                    r[i] = 0.0;
                    xcentro[i] = 0.0;
                    ycentro[i] = 0.0;
                    zcentro[i] = 0.0;
                }
            }
            else
            {
            }
            //cout << "i: " << i << "j: " << j << "\n";
        }
    }

An important mark is that the couts took tooooooooo much time. Why didn't anyone let me know about it? ;) Now I can run like 10.000 coordinates in 2 minutes. Thanks for the help.

answered Aug 03 '21 at 18:37

Elena

35
5

1

In general, any time you go out of the CPU, reading or writing, it hurts. [Some destinations hurt more than others.](https://blog.codinghorror.com/the-infinite-space-between-words/) – user4581301 Aug 03 '21 at 18:44
What compiler and compiler options and what kind of hardware do you use? – Laci Aug 03 '21 at 18:52
Please remove the `num_threads(4)`. It is extremely rare that you need to hard-wire the number of threads. Doing it is a statement that "I will throw this code away before I change my machine and no-one else will ever use this code." – Jim Cownie Aug 04 '21 at 08:07
The configuration of conditionals in this code makes no sense at all. As your inner loop starts at `i + 1`, you don't have to look for `i != j` at all. Also the outer conditional shouldn't look for `r[i] >= r[j]`, as the second inner conditional will never be reached right now. Also you are still ignoring the last element of the vector. – paleonix Aug 04 '21 at 11:48
Due to the inner loop starting at `i + 1`, the amount of work is different for different `i`. To avoid work imbalance between the threads, you should specify dynamic scheduling in the pragma: `#pragma omp parallel for schedule(dynamic)`. – paleonix Aug 04 '21 at 11:54
@PaulG. I don't understand why the condition r[i] >= r[j] would make that the inner conditional will never be reached – Elena Aug 05 '21 at 07:34
@ElenaFernandez If `r[i] >= r[j]` (outer conditional) then `r[i] < r[j]`(second inner conditional) can never be true. Even when the first inner conditional is entered and `r[j]` is set to zero, `r[i]` would have to be negative to get into the second inner conditional. – paleonix Aug 05 '21 at 09:07

Code of non-overlapping spheres not efficient enough

2 Answers2