0

Using OpenCV to compute the distances between a vector of points from a given point.

e.g.

std::vector<cv::Point2f> pointPositions 
cv::Point2f center

for(const auto& pt : pointPositions
    auto d = cv::norm(center-pt);

What are some more efficient & alternative ways to do this for a "sufficiently large data set" ? What would be a suitable multi-threaded approach to compute distances which scales well across multiple processing cores (close to or nearly 4x improvement when using 4 cores instead of 1) ?

p.s. I can avoid computing the square root and keep the distance^2

MarcoM
  • 1,093
  • 9
  • 25
  • What's wrong with a for loop? Have you profiled and seen that it's somehow too slow? – Miki Mar 01 '16 at 16:43
  • Don't think there is a faster way to do this than to iterate over the whole vector and compute the distances. Unless you group the points first if they share a coordinate, but may cause unnecessary overhead if no coordinates are shared. – Nacho Mar 01 '16 at 16:44
  • It appears that a simple for loop is too slow for me. Maybe some optimized OpenCV function can be used (for instance leveraging on parallelization) – MarcoM Mar 01 '16 at 16:53
  • 1
    Then please provide a [mcve], the typical `pointPositions` size, the execution time you have, and the execution time you need – Miki Mar 01 '16 at 17:06
  • do you need all distances or only n-nearest-neighbors or sth.? – Micka Mar 01 '16 at 20:04
  • if you need all distances, try multi-threading and vectorization (SSE instructions) and/or IPP. – Micka Mar 01 '16 at 20:05
  • 3
    After some testing, using a random set of 500 million points, 2.621 secs. for computing distances using simple method ...{ d[i] = sqrt( (a.x - b.x) * (a.x - b.x) + (a.y - b.y) * (a.y - b.y) ); }, 3.403 secs. using cv::norm ...{ d[i] = cv::norm(a - pts[i]); } and 0.734 secs. using simple method with OpenMP (seems too good to be true, will investigate further). Will report back after some more benchmarking and SSE/AVX optimisations. Do you need a specific guaranteed precision ? If the OpenMP/SIMD computation is not fast enough (unlikely), a look-up of a precomputed table may be an option. – Saharsh Bishnoi Mar 01 '16 at 20:49
  • 2
    WIP: http://pastebin.com/FTvvUGA3 – Saharsh Bishnoi Mar 01 '16 at 20:57
  • @SaharshBishnoi Turn this into an answer for better visualization. Well done! – Berriel Mar 01 '16 at 22:35
  • 1
    BTW, I'm still convinced that this question is ill posed, since OP has probably at most a few hundreds points, and execution time will be negligible – Miki Mar 01 '16 at 22:39
  • Agreed, OP's question is lacking in expected/required execution time for a given data set (i.e. MCVE, as its easy to get carried away if no expectations are set). Worth investigating a bit more thoroughly before answering as I often have this 'problem' (images and point clouds) and a faster method than a simple for loop (which may or may not be auto-vectorised by your favourite compiler) might we worth investigating. Currently can't figure out why precision is worst than -1e6 even when using /fp:precise. Hoping its a bug in the example code. – Saharsh Bishnoi Mar 01 '16 at 23:53
  • @OP Due to to CPU prefetching, optimisations for 100's and even thousands of points are not usually worth it. Only when you have millions or greater number of points SIMD and parallel optimisations (including GP-GPU) may be suitable. – Saharsh Bishnoi Mar 02 '16 at 00:06
  • @miki, please note I'm not just watching but trying to figure out why my code appears to be slow (1million points in 0.5 seconds). I didn't provide a requirement for the execution time because I wanted to consider this a general problem: I was expecting to use a general, vector based way to perform an operation on all the elements of a vector with a scalar. Apparently this is not the case, as confirmed also on others questions posted here: http://stackoverflow.com/questions/8077982/fastest-way-to-compute-distance-squared?rq=1 – MarcoM Mar 02 '16 at 07:06
  • Update WIP: http://pastebin.com/dYTcHtGP Currently achieving ~7GFlops out of a theoretical maximum of ~70GFlops (See http://stackoverflow.com/questions/8389648). Without AVX optimizations I would expect a minimum of (70/2*0.3[wiggle room] ) 10.5 GFlops. I suspect trying to keep all cores fed with valid cache lines is going to be tricky due to the asymmetry of the problem. – Saharsh Bishnoi Mar 08 '16 at 19:18
  • For example, if we assume that a single distance calculation between 2 points can be carried out in 6 floating point operations then for a CPU with 4 flops per cycle we should be able to achieve roughly (6/4*2) 75% efficiency using AVX instructions instead of a measly 10%. I suspect memory overhead plays a big part in both scenarios here. Currently using OpenMP seems to be the most efficient method till now. The next step is going to be writing AVX optimizations to perform the set of 6 float ops. on vectors for ''each stage'' of the calculation. – Saharsh Bishnoi Mar 08 '16 at 19:18

0 Answers0