Optimisation tips to find in which triangle a point belongs

Question

I'm actually having some troubles optimising my algorithm:

I have a disk (centered in 0, with radius 1) filled with triangles (not necessarily of same area/length). There could be a HUGE amount of triangle (let's say from 1k to 300k triangles)

My goal is to find as quick as possible in which triangle a point belongs. The operation has to be repeated a large amount of time (around 10k times).

For now the algorithm I'm using is: I'm computing the barycentric coordinates of the point in each triangle. If the first coefficient is between 0 and 1, I continue. If it's not, I stop. Then I compute the second coefficient with the same idea, and the third, and I do this for every triangle.

I can't think of a way to use the fact that I'm working on a disc (and the fact that I have an Euclidean distance to help me "target" the good triangles directly): If I try to compute the distance from my point to every "center" of triangles :

1) it's already more operations than what I'm doing when I brute force it with barycentric coordinates

2) I will have to order a vector containing the Euclidean distances of all triangles to my point.

3) I have absolutely no guarantee that the closest triangle to my point will be the good triangle.

I feel like I'm missing something, and that I could pre-compute, something to help me spot the good "area" before starting the brute force part.

The algorithm is already parallelised (using OpenMP): I'm calling the below function on a parallel for.

bool Triangle2D::is_in_triangle(Vector2d Point) {
    double denominator = ((Tri2D(1, 1) - Tri2D(2, 1))*(Tri2D(0, 0) - Tri2D(2, 0)) + (Tri2D(2, 0) - Tri2D(1, 0))*(Tri2D(0, 1) - Tri2D(2, 1)));
    // Computing the first coefficient
    double a = ((Tri2D(1, 1) - Tri2D(2, 1))*(Point(0) - Tri2D(2, 0)) + (Tri2D(2, 0) - Tri2D(1, 0))*(Point(1) - Tri2D(2, 1))) / denominator;
    if (a < 0 || a>1) {
        return(false);
    }
    // Computing the second coefficient
    double b = ((Tri2D(2, 1) - Tri2D(0, 1))*(Point(0) - Tri2D(2, 0)) + (Tri2D(0, 0) - Tri2D(2, 0))*(Point(1) - Tri2D(2, 1))) / denominator;
    if (b < 0 || b>1) {
        return(false);
    }
    // Computing the third coefficient
    double c = 1 - a - b;
    if (c < 0 || c>1) {
        return(false);
    }
    return(true);
}

Next step is probably to look at GPU parallelisation, but I need to make sure that the idea behind the code is good enough.

For now it takes approximately 2min30 for 75k triangles and 10k points, but this isn't fast enough.

Edit:
Triangle2D uses Eigen matrix to store coordinates

How are triangle sorted, does triangle know their neighbors? — Jarod42, Jul 08 '19 at 14:59
Have you looked at quad trees(or octtrees if you move to 3 dimensions)? Even a simple grid to divide the number of checks would probably improve the speed a lot here (Create a 2d grid, and store all your triangles their corresponding cells on that grid). This way you only need to check against the triangles in the grid cell (or cells if a point happens to be on an edge). — xEric_xD, Jul 08 '19 at 15:00
@Jarod43 For now, triangles aren't sorted, and modulo some class editing, they could know their neighbors. — P_etitdoigt, Jul 08 '19 at 15:01
@xEric_xD if you cut the disk into 4 parts how would you deal with triangles belonging in two region ? or worse, for those in the middle, 4 regions ... — P_etitdoigt, Jul 08 '19 at 15:02
https://en.wikipedia.org/wiki/Spatial_database#Spatial_index — Renat, Jul 08 '19 at 15:03
@P_etitdoigt Add them to all 4; if the triangles don't move, this will only be done once and the performance impact will be minimal. If they do move and you need to update the grid every frame, this will still much more efficient to store than brute forcing the points to all triangles. — xEric_xD, Jul 08 '19 at 15:04
With known neighbors, you might navigate from triangles using the target point direction. — Jarod42, Jul 08 '19 at 15:05
Maybe it's just me, but I can't help thinking a diagram would help. — Mark Setchell, Jul 08 '19 at 15:07
@Renat didnt knew about that, thanks a lot. I'll have a look at it :) i don't really want to use octtrees (since it's basicaly just "moving" the same problem) might be some other cool stuff on the Spatial indexing wiki that i could take advantage of. — P_etitdoigt, Jul 08 '19 at 15:10
@Jarod42 This is a good idea that would work with small number of triangles, but finding the neighbours might be some very complex(calculation wise) task... I was thinking about pre-computing the polar coordinates of the barycenter of the triangle, computing the polar coordinates of the point, and substract them (take a weighted sum of angle + modulus if needed ) & look for the minimum(s?) of them but that would imply sorting the array, i'm not even sure if that's faster... — P_etitdoigt, Jul 08 '19 at 15:42
That's why I asked which information do you have from your navigation mesh. but mostly, you only check triangles from a radius of the circle. but that assumes that you have neighbor in your triangle (or similar). — Jarod42, Jul 08 '19 at 15:59
Well actually there might be a very smart way to do it. I'll try that tomorrow: just move the disk so that it's centered in (1,1), precompute the squarred euclidian distance of each vertex, compute the squarred euclidian distance the point, and substract them ( extract the closest vertex V, which will be unique with the translation) and only try the triangles such that: V belongs to a triangle OR the triangles that have a vertex in common with the triangles given by V — P_etitdoigt, Jul 08 '19 at 16:34
@P_etitdoigt would you mind to explicitly state a quantitatively expressed target run-time for the said **10k** points run against **300k** triangles on somehow deterministically specified code-execution infrastructure? In other words, what is a time, you consider a priori fast enough, for having processed the said scope of points and triangles on some given amount of hardware resources' infrastructure - be it an HPC-centric one or a private cluster of a given amount of CPU/cores/RAM/GPU/GPU-less/FPGA/NUMA - distributed-computing nodes.Thx ***Having no target, any road may lead one forward.*** — user3666197, Jul 09 '19 at 00:38
@user3666197 A target runtime for 10k points and 300k triangles would be 2-3min on a computer with an i5 8500 (3Ghz-6core) , NVIDIA Quadro P400 ( i have to try GPU computing, not even sure if it's worth it ) , 8go RAM. Giving a try to quadtree atm :) — P_etitdoigt, Jul 09 '19 at 08:08
Can we turn the question around? That is, instead of asking "which triangle...", ask "which point, if any, is in _this_ triangle"? — Rick James, Aug 17 '19 at 20:51
Is it OK to split any long, thin, triangles in two? (I'm envisioning making better use of "bounding boxes" and R-trees.) — Rick James, Aug 17 '19 at 20:52
@Rick James the question can't be turned arround since the points are not given in advance. You are free to split the triangles as much as you want. — P_etitdoigt, Aug 18 '19 at 21:26
Can I assume any preprocessing I do of the triangles is "free"? Or must I amortize that cost over an estimated 10K lookups? — Rick James, Aug 18 '19 at 22:17
You can assume it's free. I doubt it'll be very costfull anyway — P_etitdoigt, Aug 19 '19 at 07:51

score 2 · Accepted Answer · edited Jul 11 '19 at 02:44

All long-bearded HPC-professionals, kindly do permit a bit scholastically elaborated approach here, which may ( in my honest opinion ) become interesting, if not enjoyed by, for our Community Members, that feel themselves a bit more junior than you professionally feel yourselves and who may get interested in a bit deeper look into performance-motivated code-design, performance tweaking and other parallel-code risks and benefits, that you know on your own hard-core HPC-computing experience so well and so deep. Thank you.

a) ALGORITHM (as-is) can get ~2X speedup a low-hanging fruit+more surprises yet2come

b) OTHER ALGORITHM may get ~40~80X speedup boost due2geometry

c) TIPS FOR THE BEST PARALLEL CODE + ULTIMATE PERFORMANCE

GOAL : A target runtime for 10k points in 300k triangles would be 2-3min on a computer with an i5 8500, 3GHz, 6core, NVIDIA Quadro P400 (have to try GPU computing, not even sure if it's worth it)

While this may seem as a long journey, the problem is nice and deserves a bit closer look, so please, bear with me during the flow of utmost-performance motivated thinking.

a) ALGORITHM (as-is) ANALYSIS:

The as-is use of Barycentric coordinate system is a nice trick, the straight implementation of which costs a bit more than about (20 FLOPs + 16 MEM/REG-I/O-ops) in best case and slightly above (30 FLOPs + 30 MEM/REG-I/O-ops).

There are a few polishing touches, that may reduce these execution costs down right by avoiding some expensive and even not important operations from ever taking place:

--------------------------------------------------------------------------------------
double denominator = ( ( Tri2D( 1, 1 )
                       - Tri2D( 2, 1 ) // -------------------------- 2x MEM + OP-1.SUB
                         ) * ( Tri2D( 0, 0 ) //---------------------        + OP-3.MUL
                             - Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-2.SUB
                               ) + ( Tri2D( 2, 0 ) //---------------        + OP-7.ADD
                                   - Tri2D( 1, 0 ) //--------------- 2x MEM + OP-4.SUB
                                     ) * ( Tri2D( 0, 1 ) //---------        + OP-6.MUL
                                         - Tri2D( 2, 1 ) //--------- 2x MEM + OP-5.SUB
                                           )
                       );
// Computing the first coefficient ------------------------------------------------------------------------------------------------------
double           a = ( ( Tri2D( 1, 1 )
                       - Tri2D( 2, 1 )  //-------------------------- 2x MEM + OP-8.SUB
                         ) * ( Point(0)   //------------------------        + OP-A.MUL
                             - Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-9.SUB
                               ) + ( Tri2D( 2, 0 ) //---------------        + OP-E.ADD
                                   - Tri2D( 1, 0 ) //--------------- 2x MEM + OP-B.SUB
                                     ) * ( Point(1) //--------------        + OP-D.MUL
                                         - Tri2D( 2, 1 ) //--------- 2x MEM + OP-C.MUL
                                           )
                       ) / denominator; //-------------------------- 1x REG + OP-F.DIV //----------- MAY DEFER THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever------------[3]
if (a < 0 || a>1) { // ----------------------------------------------------------------------------- a < 0 ~~    ( sign( a ) * sign( denominator ) ) < 0
    return(false); // ------------------------------------------------------------------------------ a > 1 ~~  ||        a >         denominator
}
// Computing the second coefficient
double           b = ( ( Tri2D( 2, 1 ) - Tri2D( 0, 1 ) ) //--------- 2x MEM + OP-16.SUB
                     * (      Point(0) - Tri2D( 2, 0 ) ) //--------- 2x MEM + OP-17.SUB + OP-18.MUL
                     + ( Tri2D( 0, 0 ) - Tri2D( 2, 0 ) ) //--------- 2x MEM + OP-19.SUB + OP-22.ADD
                     * (      Point(1) - Tri2D( 2, 1 ) ) //--------- 2x MEM + OP-20.SUB + OP-21.MUL
                       ) / denominator; //-------------------------- 1x REG + OP-23.DIV //---------- MAY DEFER THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever -----------[3]
if (b < 0 || b>1) { // ----------------------------------------------------------------------------- b < 0 ~~   ( sign( b ) * sign( denominator ) ) < 0
    return(false); // ------------------------------------------------------------------------------ b > 1 ~~ ||        b >         denominator
}
// Computing the third coefficient
double c = 1 - a - b; // ------------------------------------------- 2x REG + OP-24.SUB + OP-25.SUB
//         1 -(a - b)/denominator; //--------------------------------------------------------------- MAY DEFER THE MOST EXPENSIVE DIVISION EXECUTED BUT HERE, IFF INDEED FIRST NEEDED <---HERE <----------[3]

repeated re-evaluations, that appear in the original may get explicitly crafted out by manual assign/re-use, yet, there is a chance a good optimising compiler may get these evicted within a use of -O3 enforced optimisation flag.
do not hesitate to profile even this lowest-hanging fruit, to polish the most expensive parts.

//------------------------------------------------------------------
double Tri2D_11_sub_21 = ( Tri2D( 1, 1 )
                         - Tri2D( 2, 1 )
                           ), //====================================================== 2x MEM + OP-a.SUB (REG re-used 2x)
       Tri2D_20_sub_10 = ( Tri2D( 2, 0 )
                         - Tri2D( 1, 0 )
                           ), //====================================================== 2x MEM + OP-b.SUB (REG re-used 2x)
       Tri2D_00_sub_20 = ( Tri2D( 0, 0 )
                         - Tri2D( 2, 0 )
                           ); //====================================================== 2x MEM + OP-c.SUB (REG re-used 1~2x)
//-----------------------
double denominator = ( ( /*
                         Tri2D( 1, 1 )
                       - Tri2D( 2, 1 ) // -------------------------- 2x MEM + OP-1.SUB (avoided by re-use) */
                         Tri2D_11_sub_21 //=========================================== 1x REG + OP-d.MUL
                         ) * ( /*
                               Tri2D( 0, 0 ) //---------------------        + OP-3.MUL
                             - Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-2.SUB (avoided by re-use) */
                               Tri2D_00_sub_20 //===================================== 1x REG + OP-f.ADD
                               ) + ( /*
                                     Tri2D( 2, 0 ) //---------------        + OP-7.ADD
                                   - Tri2D( 1, 0 ) //--------------- 2x MEM + OP-4.SUB (avoided by re-use) */
                                     Tri2D_20_sub_10 //=============================== 1x REG + OP-e.MUL
                                     ) * ( Tri2D( 0, 1 ) //---------        + OP-6.MUL
                                         - Tri2D( 2, 1 ) //--------- 2x MEM + OP-5.SUB
                                           )
                       );
//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the first coefficient ------------------------------------------------------------------------------------------------------
//
double enumer_of_a = ( ( /*
                         Tri2D( 1, 1 )
                       - Tri2D( 2, 1 )  //-------------------------- 2x MEM + OP-8.SUB (avoided by re-use) */
                         Tri2D_11_sub_21 //=========================================== 1x REG + OP-g.MUL
                         ) * ( Point(0)   //------------------------------------------        + OP-i.MUL
                             - Tri2D( 2, 0 ) //--------------------------------------- 2x MEM + OP-h.SUB
                               ) + ( /*
                                     Tri2D( 2, 0 ) //---------------        + OP-E.ADD
                                   - Tri2D( 1, 0 ) //--------------- 2x MEM + OP-B.SUB (avoided by re-use) */
                                     Tri2D_20_sub_10 //=============================== 1x REG + OP-l.ADD
                                     ) * ( Point(1) //--------------------------------        + OP-k.MUL
                                         - Tri2D( 2, 1 ) //--------------------------- 2x MEM + OP-j.MUL
                                           )
                       );/*denominator; *///------------------------ 1x REG + OP-F.DIV (avoided by DEFERRAL THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever-----------[3]

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR A CHEAPEST EVER J.I.T./non-MISRA-C-RET-->
//
if (  enumer_of_a > denominator                          // in a > 1, THE SIZE DECIDES, the a / denominator > 1, iff enumer_of_a > denominator         a rather expensive .FDIV is avoided at all
   || enumer_of_a * denominator < 0 ) return(false);     // in a < 0, THE SIGN DECIDES, not the VALUE matters, so will use a cheaper .FMUL, instead of a rather expensive .FDIV ~~ ( sign( a ) * sign( denominator ) ) < 0

//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the second coefficient
//
double enumer_of_b = ( ( Tri2D( 2, 1 ) - Tri2D( 0, 1 ) ) //---------------------------------------- 2x MEM + OP-m.SUB
                     * (      Point(0) - Tri2D( 2, 0 ) ) //---------------------------------------- 2x MEM + OP-n.SUB + OP-o.MUL
                     + ( /*
                         Tri2D( 0, 0 ) - Tri2D( 2, 0 )   //--------- 2x MEM + OP-19.SUB + OP-22.ADD (avoided by re-use) */
                         Tri2D_00_sub_20 //======================================================== 1x REG + OP-p.ADD
                         )
                     * (      Point(1) - Tri2D( 2, 1 ) ) //---------------------------------------- 2x MEM + OP-r.SUB + OP-q.MUL
                       );/*denominator; *///------------------------ 1x REG + OP-23.DIV (avoided by DEFERRAL THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever-----------[3]

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR A 2nd CHEAPEST J.I.T./non-MISRA-C-RET-->
//
if (  enumer_of_b > denominator                          // in b > 1, THE SIZE DECIDES, the a / denominator > 1, iff enumer_of_a > denominator         a rather expensive .FDIV is avoided at all
   || enumer_of_b * denominator < 0 ) return(false);     // in b < 0, THE SIGN DECIDES, not the VALUE matters, so will use a cheaper .FMUL, instead of a rather expensive .FDIV ~~ ( sign( a ) * sign( denominator ) ) < 0

//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the third coefficient
//
double c = 1 - ( ( enumer_of_a
                 - enumer_of_b
                   )
                 / denominator
                 ); // --------------------------------------------- 3x REG + OP-s.SUB + OP-t.FDIC + OP-u.SUB <----THE MOST EXPENSIVE .FDIV BUT HERE, IFF INDEED FIRST NEEDED <---HERE <------------[3]

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR PRE-FINAL RET J.I.T./non-MISRA-C-RET-->
//
if (   c < 0 || c > 1 ) return( false );

return( true ); //~~~~~~~~~~~~~~~~~~~ "the-last-resort" RET-->

b) OTHER APPROACH TO THE ALGORITHM:

Let's review another approach, that seems both faster and a way cheaper, due to less data and lest instructions and is also promising to have high density, once smart use of vectorised AVX-2 or better AVX-512 vector-instructions will get harnessed per each core : ANIMATED, fully-INTERACTIVE explanation is altogether with analytical problem re-formulation here.

A triple-test of a point-to-line distance (each line being ax + by + c = 0) comes at cheap cost of ~ 2 FMA3 are enough + a sign test (and even better if vector-4-in-1-compact AVX-2 / 8-in-1 AVX-512 VFMADD-s)

While there may be possible to "fast" decide on whether a point makes sense to be tested against a respective triangle, possibly with statically pre-"framing" each triangle in polar(R,Theta) coordinate space by a static, precomputed tuple of ( R_min, R_max, Theta_min, Theta_max ) and "fast" discriminate each point, if it does not fit inside such a polar-segment. Yet the costs of doing this (data-access pattern costs + the costs of these "fast" instructions) will grow beyond any potentially "saved" instruction-paths, that need not take place (if a point is found outside a polar-segment). Having achieved a range of performance of 24 points-in-triangle tests per a cost of ~9 CPU-instructions per 6 CPU-cores @ 3.0+ GHz, the polar-segment "pre-testing" will suddenly become prohibitively expensive, not speaking about a second order negative effect ( introduced by a way worse cache-hit / cache-miss ratios, given more data is to be stored and read into a "fast"-pre-test ~ +16B per a triangle "framing" polar-segment tuple +8B per point ( with the worst impact on the cache hit/miss-ratio ).

That is clearly not good direction for any further move as the performance will get decreased, not increased, which is our global strategy here.

Intel i5 8500 CPU can use but AVX-2, so the most compact use of 8-triangles per CPU-clock tick per core is left for achieving even 2X higher performance, if needed.

TRIPLE-"point-above-line"-TEST per POINT per TRIANGLE:
---------------------------------------------------------------------------------
PRE-COMPUTE STATIC per TRIANGLE CONSTANTS:
                     LINE_1: C0__L1, C1__L1, C2__L1, bool_L1DistanceMustBePOSITIVE
                     LINE_2: C0__L2, C1__L2, C2__L2, bool_L2DistanceMustBePOSITIVE
                     LINE_3: C0__L3, C1__L3, C2__L3, bool_L3DistanceMustBePOSITIVE

TEST per TRIANGLE per POINT (Px,Py) - best executed in an AVX-vectorised fashion
                     LINE_1_______________________________________________________________
                     C0__L1 ( == pre-calc'd CONST = c1 / sqrt( a1^2 + b1^2 ) ) //                       
                Px * C1__L1 ( == pre-calc'd CONST = a1 / sqrt( a1^2 + b1^2 ) ) // OP-1.FMA REG-Px,C1__L1,C0__L1
                Py * C2__L1 ( == pre-calc'd CONST = b1 / sqrt( a1^2 + b1^2 ) ) // OP-2.FMA REG-Py,C2__L1, +
           .GT./.LT. 0                                                         // OP-3.SIG
                     LINE_2_______________________________________________________________
                     C0__L2 ( == pre-calc'd CONST = c2 / sqrt( a2^2 + b2^2 ) ) //                       
                Px * C1__L2 ( == pre-calc'd CONST = a2 / sqrt( a2^2 + b2^2 ) ) // OP-4.FMA REG-Px,C1__L2,C0__L2
                Py * C2__L2 ( == pre-calc'd CONST = b2 / sqrt( a2^2 + b2^2 ) ) // OP-5.FMA REG-Py,C2__L2, +
           .GT./.LT. 0                                                         // OP-6.SIG
                     LINE_3_______________________________________________________________
                     C0__L3 ( == pre-calc'd CONST = c3 / sqrt( a3^2 + b3^2 ) ) //                       
                Px * C1__L3 ( == pre-calc'd CONST = a3 / sqrt( a3^2 + b3^2 ) ) // OP-7.FMA REG-Px,C1__L3,C0__L3
                Py * C2__L3 ( == pre-calc'd CONST = b3 / sqrt( a3^2 + b3^2 ) ) // OP-8.FMA REG-Py,C2__L3, +
           .GT./.LT. 0                                                         // OP-9.SIG

( using AVX-2 intrinsics or inlined assembler will deliver highest performance due to COMPACT 4-in-1 VFMADDs )
                                             ____________________________________________
                                            |  __________________________________________triangle A: C1__L1 
                                            | |  ________________________________________triangle B: C1__L1
                                            | | |  ______________________________________triangle C: C1__L1
                                            | | | |  ____________________________________triandle D: C1__L1
                                            | | | | |
                                            | | | | |      ______________________________
                                            | | | | |     |  ____________________________triangle A: Px
                                            | | | | |     | |  __________________________triangle B: Px
                                            | | | | |     | | |  ________________________triangle C: Px
                                            | | | | |     | | | |  ______________________triandle D: Px
                                            | | | | |     | | | | |
                                            |1|2|3|4|     | | | | |
                                            | | | | |     |1|2|3|4|      ________________
                                            | | | | |     | | | | |     |  ______________triangle A: C0__L1
                                            | | | | |     | | | | |     | |  ____________triangle B: C0__L1
                                            | | | | |     | | | | |     | | |  __________triangle C: C0__L1
                                            | | | | |     | | | | |     | | | |  ________triandle D: C0__L1
                                            | | | | |     | | | | |     | | | | |
                                            |1|2|3|4|     | | | | |     | | | | |
                                            | | | | |     |1|2|3|4|     | | | | |
                                            | | | | |     | | | | |     |1|2|3|4|
  (__m256d)    __builtin_ia32_vfmaddpd256 ( (__v4df )__A, (__v4df )__B, (__v4df )__C ) ~/ per CPU-core @ 3.0 GHz ( for actual uops durations check Agner or Intel CPU documentation )
  can
  perform 4-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU-core @ 3.0 GHz
         24-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU

  using AVX-512 empowered CPU, can use 8-in-1 VFMADDs
  could
  perform 8-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU-core @ 3.0 GHz
         48-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU

c) TIPS FOR THE BEST PARALLEL CODE + ULTIMATE PERFORMANCE:

Step -1: GPU / CUDA costs v/s benefits

If your PhD-mentor, professor, boss or Project Manager indeed insists you to develop a GPU-computing c++/CUDA code solution for this very problem, the best next step is to ask for getting any better suited GPU-card for such a task, than the one you posted.

Your indicated card, the Q400 GPU has just 2 SMX (48KB L1-cache each) is not quite fit for a seriously meant parallel CUDA-computing, having about 30X-less processing devices to actually do any SIMD-thread-block computing, not mentioning its small memory and tiny L1 / L2 on-SMX-caches. So, after all CUDA-related design and optimisation efforts, there will be not more, but a single (!) pair of the GPU-SMX warp32-wide thread-block executions in the SIMD-threads, so there is no big circus to be expected from this GP107-based device and there are 30+X-better equipped devices for some indeed highly performing parallel-processing available COTS )

Step 0: pre-compute and pre-arrange data ( MAXIMISE cache-line COHERENCY ):

Here, it makes sense to optimise the best macroscoping looping of the algorithm, so that you "benefit" most from cache hits ( i.e. best re-use of "fast" data, pre-fetched already )

So, may test, if it is cache-wise faster to rather distribute the work over N-concurrent-workers, who process disjunct working-blocks of triangles, where they each loop over the smallest memory-area --- all the points ( ~ 10k * 2 * 4B ~ 80 kB ), before moving into next triangle in the working-block. Ensuring the row-first array alignment into memory-area is vital ( FORTRAN guys can tell a lot about the costs/benefits of just this trick for an HPC-fast compact/vectorised vector-algebra and matrix-processing )

The benefit?

The cached coefficients will be re-used ~ 10k times (at a cost of ~ 0.5~1.0 [ns], instead of re-fetching costs of + 100 ~ 300 [ns] if these would have to be re-read by a RAM memory access). Difference, about ~ 200X ~ 600X is worth an effort to best align the workflow sub-ordinated to data-access-patterns and cache-line resources.

Results are atomic - any point will belong to one and only one (non-overlapping) triangles.

This simplifies the a priori non-colliding and relatively sparse writes into a resulting vector, into where any detected point-inside-triangle can freely report an index of such triangle found.

Using this vector of results for potentially avoiding any re-testing on points, where there has already been performed a match ( index is non-negative ) isnot very efficient, as the costs of re-reading such indication and re-arranging the compact alignment of 4-in-1 or 8-in-1 point-in-triangle tests, would become adversely expensive to potential "savings" of not re-testing an already mapped point.

So the 10k points over a block of 300k triangles may yield a workflow of:
a split of some 300k / 6 cores
~ 50k triangles / 1 core
~ 50k * 10 k point-in-triangle tests per core
~ 500M point-in-triangle tests per core @ 3.0+ GHz
~ 125M AVX-2 vector-4-in-1-compact test-executions per-core
~ 125M tests x 10 uops-instructions @ 3.0 GHz ...
which is 1.25G uops @ 3.0+ GHz ... second(s)? Yes, there is a pretty strong motivation to go down here, towards this ultimate performance and direct the further work this way.

SO, HERE WE ARE:

The principally achievable HPC-target is in a range of a few seconds for 300+k Triangles / 10+k Points on just a 6-core AVX-2 i5 8500 @ 3.0+ GHz

Worth an effort, isn't it?

My knowledge in therm of cache optimisation and stuff like vectorised AVX-2 or AVX-512 vector-instructions is literally non-existant so first things first, huge thanks for bringing that to me. Hardest part of programming is not to understand stuff, just to know they exist... That being said, are you sure that the sign test function you designed fits well the problem ? if a point belongs to 2 "overlapping" halfplain or 0, the sign would be 1 in any case right ? Anyway, i get the idea, fixing that shouldn't be too hard — P_etitdoigt, Jul 09 '19 at 18:36
@P_etitdoigt Always welcome. The point-to-line distance is signed - see the url-ed animated tools with a complete analytical explanation. due to minimisation of the Distance( P(x,y), 0=ax+bx+c ) calculations, we cannot use directed vectors, like in the b-rep { 2D | 3D-solid } modelling, yet the < or > signs have to be in the plane and bool_LxDistanceMustBePOSITIVE just mediate this choice. Singularities are a special case ( yet the IEEE-number precision will be a principal source of an unavoidable imprecision and resulting uncertainty ). The last of more "observers" of a point-in-triangle wins — user3666197, Jul 09 '19 at 20:09
@P_etitdoigt For signs,read the **Equation 5** in the **Animated Analysis** ( on https://www.desmos.com/calculator/adido1achw ) and you see the product ( with a all side effects ), as the Desmos tool provides no way for expressing logical operators. Simply said, all cases of Distance( P(x,y), L{1|2|3} ) have to be positive ( using the bool_LxDistanceMustBePOSITIVE to match the easy constants usage ). Less constants means faster processing, bool-flags are cheap in the end :o) — user3666197, Jul 09 '19 at 20:19
Will try that tomorrow & compare with rtree ! Note: I'm not a that good at english since it's not my mother tongue ( sorry for stupid mistakes è_é ) nor that good at programming (just a math student tinkering) — P_etitdoigt, Jul 09 '19 at 20:33

score 1 · Answer 2 · edited Jul 09 '19 at 00:41

I think you could make an array with boundaries for each triangle: top, bottom, right and left extremes. Then, compare your point to these boundaries. If it falls within one, THEN see if it really is within the triangle. That way, the 99.9% case doesn't involve a double floating-point multiply and a number of additions - just comparisons. The computationally expensive operations are only undertaken if the point is within the rectilinear extremes of the triangle.

This could be sped up further yet, by e.g. sorting on e.g. the topmost extreme of the triangles, and using a binary search; and then begin by finding the point that is the highest point that is below your triangle, and then, checking the ones above it. That way, you'd only have to check just over half as many. If there were an upper bound on the height of the extremes of the triangles, you could check far less yet. Note that this latter strategy will make your source code more complex - so it will be a case of determining how much effort you wish to put into the optimization for how much result. The first part seems like it would be fairly easy, and help a lot. The sorted list: more effort for only almost halving your operations. I'd see if the first strategy was enough for you first.

"Then, compare your point to these boundaries." But wouldnt that be as complex as what i'm doing ? In my program computing the first coefficient kills like 99% of the triangles in a very few operations ... — P_etitdoigt, Jul 08 '19 at 15:21
No, because comparing to the boundaries is computationally cheaper than calculating the first coefficient. While you'll have more compares than calculations of the first coefficient, a simple compare is less than half as computationally expensive as that big long formula. — CodeLurker, Jul 08 '19 at 16:13

score 1 · Answer 3 · answered Feb 04 '20 at 00:56

Quad trees are nice if you have to do lots of single queries, but if you've got 10k to do all at once there is an algorithm purpose-built for this: The sweep line. I'm getting query times of less than 5 seconds on similar data in a single thread.

We're going to take a vertical line and sweep it from left to right across your 2d plane, and we're going to keep track of which triangles that line is intersecting at any point in time. So, whenever your line intersects one of your query points as it's sweeping, you only ever have to check the triangles that your vertical line is overlapping. No other triangles can contain that point.

I kind of hate the name "sweepline" in this case because it gives the mental image of smoothly traversing the plane, which it doesn't do. It just jumps to the next position of interest in order from left to right.

Depending on the ratio of triangles to queries, you could also put the overlapping triangles in a segment tree or interval tree by their y-coordinates, but this was actually slower for my data (or my implementation was bad. Also possible). Definitely worth a try though, especially if you can put off balancing the tree until its needed.

The setup time is just 3 sorts:

Get a vector of pointers to your triangles sorted by the minimum x-value of the tri
Get a vector of pointers to your triangles sorted by the maximum x-value of the tri
Get a vector of pointers to your points sorted by their x-value

Then we sweep:

Initialize the "current positions" for each of your 3 sorted lists to 0
Look for the lowest x-value at the current position in your 3 sorted lists. This will be the next x-position of your sweepline
- If it's the leftmost point of a triangle, then add that triangle to your "overlapping" vector
- If it's a query point, then check that point against all the currently overlapping tris
- If it's the rightmost point of a triangle, then remove it from the overlapping vector.
increment the current position for that list
Loop until you're out of points, or out of tris.

Why is this better than a quadtree? Because we're always keeping track of the tris currently intersecting our sweepline. This means we're re-using data between queries, and quadtrees don't really do that. Quadtrees will have much better single-query performance, but this kind of bulk lookup is made for the sweepline.

score 0 · Answer 4 · edited Jul 09 '19 at 17:03

0

Using boost's quadtratic rtree to find the closest triangle did perfectly the job. The algo is running in less than a minute now (for 75k triangles and 100k points (10 times more !))

I ended up building a tree by puting in a box every triangle, & looking for the Knn of a point & test the corresponding triangles :) Was expecting way more trouble going into a new domain like Spatial database, but Boost is really an insane library

edited Jul 09 '19 at 17:03

Bhargav Rao

50,140
28
121
140

answered Jul 09 '19 at 15:34

P_etitdoigt

39
5

That seems to follow the outlines of my answer: check the bounding boxes of the triangles, and use a sorted list to find them. Of course, I didn't mention (or know of) the spatial database in Boost. – CodeLurker Jul 10 '19 at 01:38
@CodeLurker It does (and all credit to you & those who recommanded quadtree ). I just want to test the approach given by user3666197, before i close this topic :) – P_etitdoigt Jul 10 '19 at 07:17

Optimisation tips to find in which triangle a point belongs

I'm actually having some troubles optimising my algorithm:

4 Answers4

Linked