1

I have a really bad effiency problem. I need to get the furthest distance between a set of points and my first "brute force" algorithm takes almost 80 seconds. I need it to happen within 1 second. Worst case scenario is moving the calculations into background processes and multithreading them but it still needs to be faster so here's my first ever stackoverflow question..

The data I have are 39 000 sets of coordinates, each set contains about 200 x,y coordinates and I'm looking for the furthest distance in each set.

The datapoints are represented by x and y and I'm computing the distance between them using Math.Sqrt(deltaX * deltaX + deltaY * deltaY)

The datapoints can be in any order.

My brute force attempt looks like this

public static double getAbsoluteMax(IEnumerable<DataPoint> dataPoints)
{
    double maxDistance = 0;

    foreach (DataPoint dp1 in dataPoints)
    {
        foreach (DataPoint dp2 in dataPoints)
        {
            double deltaX = dp1.x - dp2.x;
            double deltaY = dp1.y - dp2.y;
            double distance = Math.Sqrt(deltaX * deltaX + deltaY * deltaY);
            if (distance > maxDistance)
            {
                maxDistance = distance;
            }
        }
    }
    return maxDistance;
}

I am calling this function with 200 values each time.. 39 000 times.

My first thought was Memoize that is found in Perl, it caches the results of any method call and then looks it up if the same method is called with the same parameters. Maybe creating a lookup table with results from the math could help?

Maybe I could move the calculations to matlab or something similar ?

The application is .net 4.5 and the calculations are in a .net 4.5 dll

vitaut
  • 49,672
  • 25
  • 199
  • 336
Alex
  • 27
  • 1
  • 3
  • One problem is that you're calculating the distance for each pair of points twice. You ought to be able to halve your execution time by fixing that. Also, devise a way to avoid comparing any point with itself. – phoog Sep 23 '15 at 18:18
  • I realize I'm comparing each point twice, but if I split them up into 2 "bunches", I could miss that the furthest distance is between the first and second points. – Alex Sep 23 '15 at 18:19
  • 6
    Also you don't need to call `Math.Sqrt` in the loop, you can do it for the `maxDistance` only. – Dmitry Sep 23 '15 at 18:19
  • @Alex The solution is not to split them into bunches, as you note. What kind of collection are they in? If they're in arrays or lists, you can index them explicitly to avoid the duplication. – phoog Sep 23 '15 at 18:20
  • Thank you, Dmitry, can't believe I missed that one :) – Alex Sep 23 '15 at 18:21
  • @phoog they are from a List that get's splitted into IEnumerable – Alex Sep 23 '15 at 18:23
  • @Dmitry, moving sqrt got me down to 30 seconds from 80 =D – Alex Sep 23 '15 at 18:25
  • @obchardon not sure I understand, do you have a link or something I could google describing what kind of 'classical' algorithm I missed in school? – Alex Sep 23 '15 at 18:29
  • @Alex: You tagged this question with `matlab`, is this inteded? Where is the relation to matlab? – Daniel Sep 23 '15 at 18:31
  • @Daniel Just wondered that myself then I read the 2nd last line. "Maybe I could move the calculations to **matlab** or something similar ?" – IKavanagh Sep 23 '15 at 18:32
  • Yeah, since it's mostly calculations, I wondered if putting them in a matlab functions would help, but it has been pointed out that it wouldn't do much good – Alex Sep 23 '15 at 18:34
  • 2
    maybe something like http://www.seas.gwu.edu/~simhaweb/alg/lectures/module1/module1.html – John Boker Sep 23 '15 at 19:07
  • Where do these 39,000 sets of points come from? There might be something you could exploit in the data to make your algo faster. – Jeff Walker Code Ranger Sep 23 '15 at 19:43
  • @Alex since they're in lists you can do, roughly, `for (int i = 0; i < list.Length - 1; i++) for (int j = i + 1; j < list.Length; j++) { LoopBody(list[i], list[j]); }` The convex hull bit is probably faster, though, since this is still exponential time; it's just halved. – phoog Sep 23 '15 at 19:48
  • Welcome to SO.. Multithreading is not a worst case scenerio..It is actually a fairly easy thing to do.. Can you clairy how many iterations were in that 80sec benchmark? – Brett Caswell Sep 24 '15 at 15:54

5 Answers5

3

Find the convex hull in n log n. It might reduces your set, then find the hull diameter in 2 n. Basic graph theory should give you a lot performance.

Additionally the function call overhead 39000 times is expensive... Also the datasets kill the garbage collector. You should try to create some reusable array... 78000 enumerators are killing. Just use a array of say 200 values and reuse it. And use for int not for each... Drop the sqrt and return value squared it saves a few million sqrts... Maybe you can use int's only no doubles... Use multiple threads/cores... And sse instructions or offload to the gpu

Compile with release build and code optimizations on

for example this runs in about 2.5 secs:

Random r = new Random(100);
double[] x = new double[200];
double[] y = new double[200];
double maxD = 0;
Stopwatch stopwatch = Stopwatch.StartNew();
for (int i = 0; i < 39000; i++)
{
    for (int j = 0; j < 200; j++)
    {
        x[j] = r.Next();
        y[j] = r.Next();
    }
    for (int j = 0; j < 200; j++)
    {
        for (int k = j + 1; k < 200; k++)
        {
            double dx = x[j] - x[k];
            dx = dx * dx;
            double dy = y[j] - y[k];
            dy = dy * dy;
            double d = dx + dy;
            // this is slow (80 secs):
            //double d = Math.Pow(x[j] - x[k], 2) + Math.Pow(y[j] - y[k], 2);
            if (maxD < d) maxD = d;
        }
    }
}
Console.WriteLine($"{stopwatch.ElapsedMilliseconds}");

The Math.Pow call (from http://referencesource.microsoft.com/#mscorlib/system/math.cs):

  [System.Security.SecuritySafeCritical]  // auto-generated
  [ResourceExposure(ResourceScope.None)]
  [MethodImplAttribute(MethodImplOptions.InternalCall)]
  public static extern double Pow(double x, double y);
Wouter
  • 2,540
  • 19
  • 31
  • You need the hull to reduce the number of calculations. (Your comment is basically repeating the question, I.e. Finding the diameter of all point sets) – Wouter Sep 23 '15 at 18:52
  • 3
    Don't use `DateTime.Now` to calculate execution time. Use `StopWatch` instead. http://stackoverflow.com/questions/28637/is-datetime-now-the-best-way-to-measure-a-functions-performance – juharr Sep 23 '15 at 19:29
  • I don't know why you're making the assertion to use `int`, the `DataPoint` class properties are `double` https://msdn.microsoft.com/en-us/library/system.windows.forms.datavisualization.charting.datapoint(v=vs.110).aspx .. also, why are you including the generation of 400 elements * 39000 iterations in the stopwatch time? – Brett Caswell Sep 24 '15 at 13:59
  • also, I've noticed this with pretty much with all the samples so far. but why bother with.the declaration, assignment, and usage of `dx` and `dy`.. in particular, dx * dx is simply dx^2.. `Math.Pow(x[j] - x[k], 2) + Math.Pow(y[j] - y[k], 2)`.. – Brett Caswell Sep 24 '15 at 14:43
  • i didn't check the datapoint type but double is 64 bit and int 32 bit so could be faster... measuring the generation.... could be left out but then all samples are the same... about math.pow: it is a lot of function overhead.. (it degrades the performnce to about 80 secs) – Wouter Sep 24 '15 at 15:00
  • I'm sure that's incorrect.. It should improve performance.. the 80 secs you're seeing is probably because you are now correctly using `double`, which means you're now correctly doing double point arthimetic (as the question intended), which is quite a bit slower. – Brett Caswell Sep 24 '15 at 15:10
  • 1
    i already changed the coordinates to doubles. math.pow has no floating overload – Wouter Sep 24 '15 at 15:20
  • it seems the only way we can improve speed on this kind of arithmetic is too vectorize it and send it to the GPU.. which makes me curious in regrds to MatLab now.. – Brett Caswell Sep 24 '15 at 15:23
  • Matlab isn't really that much faster... someone else tested it already... function calls are expensive... i would expect the compiler to inline math.pow but it doesn't – Wouter Sep 24 '15 at 15:27
  • oh, right.. I see.. his sample was just a single iteration of 200.. 0.01s.. (I thought that was ms) ... that's quite a bit slower then I thought.. – Brett Caswell Sep 24 '15 at 15:31
1

For any point set: If two points a and b have a distance larger or equal than all other pairs of points, they are part of the convex hull.

Knowing this you could calculate the convex hull first, then use your approach only for the points on the convex hull.

An example implementation in Matlab

Example data:

points=rand(2000,2);

Simple solution just using pdist

distance_matrix=squareform(pdist(points));
[distance,index]=max(distance_matrix(:));
[a,b]=ind2sub(size(distance_matrix),index);
fprintf('the points with index %d and %d have the largest distance\n',a,b)

"intelligent" way using convex hull

c=convhull(points(:,1),points(:,2));
%don't need the duplicated last/first entry which loops
c=c(1:end-1);
distance_matrix=squareform(pdist(points(c,:)));
[distance,index]=max(distance_matrix(:));
[a,b]=ind2sub(size(distance_matrix),index);
fprintf('the points with index %d and %d have the largest distance\n',c(a),c(b))

Both solutions run in less than 0.01s for 200 random points.

Daniel
  • 36,610
  • 3
  • 36
  • 69
0
public static double GetAbsoluteMax(List<DataPoint> list) {
    double max = 0;
    // calc dist for each distinct pair Dist(P_1, P_2) == Dist(P_2, P_1)
    for(var i=0; i<list.Count-1; i++) {
        for(var j=i+1; j<list.Count; j++) {
            var dX = list[i].X - list[j].X;
            var dY = list[i].Y - list[j].Y;
            // don't calculate the Square Root yet
            var dist = dX * dX + dY * dY;
            if(dist > max) {
                max = dist;
            }
        }
    }
    return Math.Sqrt(max);
}

ideone http://ideone.com/KSfm0X

The best you can do is not repeat distance calculations and don't calculate the square root until the end.

EDIT Here's a bonus multithreaded/parallel version (I used ThreadPool, so it would run in IDEONE.COM but using tasks would probably be more appropriate if you're target is .NET 4.5

using System;
using System.Linq;
using System.Collections.Generic;
using System.Threading;
public class Test
{
    public class DataPoint {
        public double X {get; set;}
        public double Y {get; set;}
    }
    public static void Main()
    {
        int count = 39;
        // generate our points
        List<DataPoint>[] sets = new List<DataPoint>[count];
        double[] result = new double[count];
        for(var i=0; i<sets.Length; i++) { 
            sets[i] = GetRandomDataPoints(200); 
        }
        // run our calculations async
        int running = count;
        using(ManualResetEvent resetEvent = new ManualResetEvent(false)){
            for(int i=0; i<count; i++) {
                int idx = i;
                ThreadPool.QueueUserWorkItem(
                    new WaitCallback(o => {
                        result[idx] = GetAbsoluteMax(sets[idx]);
                        if (Interlocked.Decrement(ref running) == 0) {
                            resetEvent.Set();
                        }
                    }),
                    null
                );
            }
            resetEvent.WaitOne();
        }
        foreach(var max in result) {
            Console.WriteLine(max);
        }
    }
    public static double GetAbsoluteMax(List<DataPoint> list) {
        double max = 0;
        // calc dist for each distinct pair Dist(P_1, P_2) == Dist(P_2, P_1)
        for(var i=0; i<list.Count-1; i++) {
            for(var j=i+1; j<list.Count; j++) {
                var dX = list[i].X - list[j].X;
                var dY = list[i].Y - list[j].Y;
                // don't calculate the Square Root yet
                var dist = dX * dX + dY * dY;
                if(dist > max) {
                    max = dist;
                }
            }
        }
        return Math.Sqrt(max);
    }
    public static List<DataPoint> GetRandomDataPoints(int size) {
        var result = new List<DataPoint>();
        var rnd = new Random();
        for(var i=0; i<size; i++) {
            result.Add(new DataPoint() { 
                X=rnd.NextDouble()*100, 
                Y=rnd.NextDouble()*100 
                });
        }
        return result;
    }
}
Louis Ricci
  • 20,804
  • 5
  • 48
  • 62
-1

Here's how you can make sure you don't compare data points twice, and only iterate the IEnumerable once. It also saves the Math.Sqrt for the end.

public static double getAbsoluteMax(IEnumerable<DataPoint> dataPoints)
{
    double maxDistanceSquared = 0;

    var previousPoints = new List<DataPoint> { dataPoints.FirstOrDefault() };

    foreach (DataPoint dp1 in dataPoints.Skip(1))
    {
        foreach (DataPoint dp2 in previousPoints)
        {
            double deltaX = dp1.x - dp2.x;
            double deltaY = dp1.y - dp2.y;
            double distanceSquared = (deltaX * deltaX) + (deltaY * deltaY);
            if (distanceSquared > maxDistanceSquared)
            {
                maxDistanceSquared = distanceSquared;
            }
        }

        previousPoints.Add(dp1);
    }

    return Math.Sqrt(maxDistanceSquared);
}

Basically what it does is puts the first point into a list then iterates the rest of the points. For each point it then iterates the list and does the distance check. Then it adds that point to the list. So each iteration will only compare that point to the previous points.

juharr
  • 31,741
  • 4
  • 58
  • 93
  • Thank you, I am actually down to 15 seconds, I will mark you as answer unless someone can improve this =) I started to study linq, that's why I tagged Linq.. I thougt there was a linq solution to this.. – Alex Sep 23 '15 at 18:49
  • It sounds like using the convex hull algorithm to reduce the number of points to compare will help. – juharr Sep 23 '15 at 18:58
  • Dear down voter, care to explain. Granted this might not get to the 1 second the OP asked for. My main point was to at least show how to avoid the duplicated distance calculations when dealing with an `IEnumerable`. – juharr Sep 23 '15 at 19:09
  • @juharr the `Skip(1)` bit looked suspicious to me; I had to look at it a little more carefully to see that it does in fact work. Perhaps the downvote was the result of an over-hasty (and faulty) evaluation of your algorithm. – phoog Sep 23 '15 at 19:44
  • @phoog I think someone down voted all the answers that show how to avoid calculating the distance of two points twice at the same time. Or at least that's how it looked at the time. – juharr Sep 23 '15 at 19:49
-1

It's still brute force, but I got it down from 12 seconds to 6 by:

1) Pre-processing the list into separate x & y arrays

2) Caching the point in the outer loop

    public static double GetAbsoluteMax(double[] xs, double[] ys)
    {
        double maxDistanceSqd = 0;

        int nPts = xs.Length;

        for ( int i =0; i < nPts ; ++i )
        {
            double x1 = xs[i];
            double y1 = ys[i];

            for (int j = 0; j < nPts; ++j)
            {
                double deltaX = x1 - xs[j];
                double deltaY = y1 - ys[j];
                double distsqd = (deltaX * deltaX + deltaY * deltaY);
                if (distsqd > maxDistanceSqd)
                {
                    maxDistanceSqd = distsqd;
                }
            }
        }
        return Math.Sqrt(maxDistanceSqd);
    }
Chaz
  • 319
  • 1
  • 6