3

I initially implemented the Hoey-Shamos algorithm, however it is too complex for future maintainability (I have no say in this), and it wasn't reporting correctly, so an optimized brute force algorithm is what I'm going to use.

My question is: How can I optimize this code to be usable?

As it stands, my code contains a nested for loop, iterating the same list twice.

EDIT: Turned lines into a HashSet and used two foreach loops... shaved about 45 seconds off scanning 10,000. It's still not enough.

foreach (Line2D g in lines)
{                   
foreach (Line2D h in lines)
{
    if (g.intersectsLine(h))
    {
        return false;
    }
}                  

 } // end 'lines' for each loop

If I force my "intersectsLine()" method to return false (for testing purposes) it still takes 8 seconds to scan 10,000 records (I have 700,000 records). This is too long, so I need to optimize this piece of code.

I tried removing lines from the List after it's been compared to all the other lines, however there's an accuracy issue (no idea why) and the speed increase is barely noticeable.

Here is my intersectsLine method. I found an alternate approach here but it looks like it'd be slower with all the method calls and whatnot. Calculating the slope doesn't seem to me like it'd take too much computing (Correct me if I'm wrong?)

public bool intersectsLine(Line2D comparedLine)
{

//tweakLine(comparedLine);
if (this.Equals(comparedLine) ||
    P2.Equals(comparedLine.P1) ||
    P1.Equals(comparedLine.P2))
{
    return false;
}

double firstLineSlopeX, firstLineSlopeY, secondLineSlopeX, secondLineSlopeY;

firstLineSlopeX = X2 - X1;
firstLineSlopeY = Y2 - Y1;

secondLineSlopeX = comparedLine.X2 - comparedLine.X1;
secondLineSlopeY = comparedLine.Y2 - comparedLine.Y1;

double s, t;
s = (-firstLineSlopeY * (X1 - comparedLine.X1) + firstLineSlopeX * (Y1 - comparedLine.Y1)) / (-secondLineSlopeX * firstLineSlopeY + firstLineSlopeX * secondLineSlopeY);
t = (secondLineSlopeX * (Y1 - comparedLine.Y1) - secondLineSlopeY * (X1 - comparedLine.X1)) / (-secondLineSlopeX * firstLineSlopeY + firstLineSlopeX * secondLineSlopeY);

if (s >= 0 && s <= 1 && t >= 0 && t <= 1)
{
    //console.WriteLine("s = {0}, t = {1}", s, t);
    //console.WriteLine("this: " + this);
    //console.WriteLine("other: " + comparedLine);
    return true;
}

return false; // No collision */

}

EDIT: The major bottleneck is the big polygons! I excluded running polygons with more than 100 lines, and it ran all 700,000+ polygons in 5:10. Which is in the acceptable range! Surely there's a way to see if a polygon is worth calculating before running this code? I have access to the XMin, Xmax, YMin, and YMax properties if that helps?

Ran another test, limiting polygons to under 1000 lines each. It took a little over an hour.

I removed all limiting of polygons, and it's been running for 36 hours now... still no results.

A couple ideas I have:

-When I generate my lines hashset, have another hashset/List that has candidates for intersection. I would only add lines to this list if there's a chance for intersection. But how do I eliminate/add possibilities? If there's only three lines that could possibly intersect with others, I'd be comparing 4000 lines against 3 instead of 4000. This alone could make a HUGE difference.

-If the same point occurs twice, except the first and last point, omit running the nested for loop.

Edit:


Information about the polygons: 700,000 total

There is over four thousand polygons with 1,000 or more points

There is 2 polygons with 70,000 or more points


I think it's possible to get this down to fifteen or so minutes with a bit of creativity.

RBarryYoung
  • 55,398
  • 14
  • 96
  • 137
Evan Parsons
  • 1,139
  • 20
  • 31
  • As this is a programing site, there are some issues that relate to what data type you plan to use: FP or integer? – chux - Reinstate Monica Aug 22 '13 at 18:45
  • Not quite sure what you mean? I have a function within the object that returns true/false. The function will return false if the polygon has self-intersecting lines. – Evan Parsons Aug 22 '13 at 19:01
  • Are the points of your polygons an integer or floating-point type? – chux - Reinstate Monica Aug 22 '13 at 19:54
  • They are doubles and have several decimal places. It's ArcGIS data. – Evan Parsons Aug 23 '13 at 11:51
  • 1
    Not a complete answer, but you could use QuadTrees (http://en.wikipedia.org/wiki/Quadtree). This nice article http://blogs.msdn.com/b/kaelr/archive/2009/05/21/priorityquadtree.aspx has a cool `PriorityQuatree` implementation. – Simon Mourier Sep 30 '13 at 20:15
  • Is there any particular part of that that could help me? From what I read, it looks like that's for drawing shapes on a screen. @SimonMourier – Evan Parsons Oct 01 '13 at 12:33
  • Quadtrees are mostly used to detect intersections. Check the source for methods such as `HasItemsIntersecting`, `GetItemsIntersecting`, `GetItemsInside`, etc. – Simon Mourier Oct 01 '13 at 13:08
  • I assume I'd have to make some modifications, I'll do some reading, thanks. – Evan Parsons Oct 01 '13 at 13:18
  • 1
    Can you supply a sample/test dataset? – RBarryYoung Oct 01 '13 at 20:02
  • Let me see if I understand the constraints correctly here. You cannot use the best algorithm (Hoey-Shamos) because it's too complicated. You want to use Brute-Force, but it's O(n^2) and realistically, no amount of mere code optimization is going to make 490,000,000,000 comparisons tractable. So what you really want is an algorithm+implementation that is faster than brute-force but simple enough to pass off as an "*Optimized Brute-Force*". Is that about right? – RBarryYoung Oct 01 '13 at 20:07
  • Yes, you are 100% correct. I have no control over the requirements unfortunately. – Evan Parsons Oct 02 '13 at 11:33
  • Have you considered using well known high optimized implementations (IsSimple function) like GDAL Lib (has good, fast SWIG C# bindings) or even PostGIS (without storage, just using SQL/MM Spatial engine)? – SalientBrain Oct 02 '13 at 13:44
  • Not sure about PostGIS since none of my processing is doing any SQL. Which one of those mentioned has an isSimple function? I don't see anything on the GDal Lib website. – Evan Parsons Oct 02 '13 at 14:22
  • Well I have something that I think might work, but I need some sample data that I can test against so that I can make sure that it works correctly. – RBarryYoung Oct 03 '13 at 01:07
  • I can share output for this dummy data I created in ArcMap, but that's about it unfortunately :( http://sharetext.org/b0uL – Evan Parsons Oct 03 '13 at 12:00
  • OK, those are some strangely shaped polygons. If the bigger real polygons look like these, I am not sure that my optimization will help much. What do these polygons represent and why does it matter if they cross themselves? – RBarryYoung Oct 03 '13 at 15:34
  • The code at the top is what I use to separate the rings. (It's ArcGIS data). The polygons are important. I'm scanning property data. I'm about to update my post with some more info. – Evan Parsons Oct 03 '13 at 15:54
  • Well, for one thing, you appear to be doing every possible comparison twice, and are comparing each line to itself as well. Cleaning those problems up should cut the runtime roughly in half. – Jon Oct 03 '13 at 16:39
  • OK, good. Property plots are sufficiently well-behaved that my idea should perform well enough. – RBarryYoung Oct 03 '13 at 21:27

3 Answers3

6

Your current Brute-Force algorithm is O(n^2). For just your two 70,000 line polygons that's some factor of almost 10 Billion operations, to say nothing of the 700,000 other polygons. Obviously, no amount of mere code optimization is going to be enough, so you need some kind algorithmic optimization that can lower that O(n^2) without being unduly complicated.

The O(n^2) comes from the nested loops in the brute-force algorithm that are each bounded by n, making it O(n*n). The simplest way to improve this would be to find some way to reduce the inner loop so that it is not bound by or dependent on n. So what we need to find is some way to order or re-order the inner list of lines to check the outer line against so that only a part of the full list needs to be scanned.

The approach that I am taking takes advantage of the fact that if two line segments intersect, then the range of their X values must overlap each other. Mind you, this doesn't mean that they do intersect, but if their X ranges don't overlap, then they cannot be intersecting so theres no need to check them against each other. (this is true of the Y value ranges also, but you can only leverage one dimension at a time).

The advantage of this approach is that these X ranges can be used to order the endpoints of the lines which can in turn be used as the starting and stopping points for which lines to check against for intersection.

So specifically what we do is to define a custom class (endpointEntry) that represents the High or Low X values of the line's two points. These endpoints are all put into the same List structure and then sorted based on their X values.

Then we implement an outer loop where we scan the entire list just as in the brute-force algorithm. However our inner loop is considerably different. Instead of re-scanning the entire list for lines to check for intersection, we rather start scanning down the sorted endpoint list from the high X value endpoint of the outer loop's line and end it when we pass below that same line's low X value endpoint. In this way, we only check the lines whose range of X values overlap the outer loop's line.

OK, here's a c# class demonstrating my approach:

class CheckPolygon2
{
    // internal supporting classes
    class endpointEntry
    {
        public double XValue;
        public bool isHi;
        public Line2D line;
        public double hi;
        public double lo;
        public endpointEntry fLink;
        public endpointEntry bLink;
    }
    class endpointSorter : IComparer<endpointEntry>
    {
        public int Compare(endpointEntry c1, endpointEntry c2)
        {
            // sort values on XValue, descending
            if (c1.XValue > c2.XValue) { return -1; }
            else if (c1.XValue < c2.XValue) { return 1; }
            else // must be equal, make sure hi's sort before lo's
                if (c1.isHi && !c2.isHi) { return -1; }
                else if (!c1.isHi && c2.isHi) { return 1; }
                else { return 0; }
        }
    }

    public bool CheckForCrossing(List<Line2D> lines)
    {
        List<endpointEntry> pts = new List<endpointEntry>(2 * lines.Count);

        // Make endpoint objects from the lines so that we can sort all of the
        // lines endpoints.
        foreach (Line2D lin in lines)
        {
            // make the endpoint objects for this line
            endpointEntry hi, lo;
            if (lin.P1.X < lin.P2.X)
            {
                hi = new endpointEntry() { XValue = lin.P2.X, isHi = true, line = lin, hi = lin.P2.X, lo = lin.P1.X };
                lo = new endpointEntry() { XValue = lin.P1.X, isHi = false, line = lin, hi = lin.P1.X, lo = lin.P2.X };
            }
            else
            {
                hi = new endpointEntry() { XValue = lin.P1.X, isHi = true, line = lin, hi = lin.P1.X, lo = lin.P2.X };
                lo = new endpointEntry() { XValue = lin.P2.X, isHi = false, line = lin, hi = lin.P2.X, lo = lin.P1.X };
            }
            // add them to the sort-list
            pts.Add(hi);
            pts.Add(lo);
        }

        // sort the list
        pts.Sort(new endpointSorter());

        // sort the endpoint forward and backward links
        endpointEntry prev = null;
        foreach (endpointEntry pt in pts)
        {
            if (prev != null)
            {
                pt.bLink = prev;
                prev.fLink = pt;
            }
            prev = pt;
        }

        // NOW, we are ready to look for intersecting lines
        foreach (endpointEntry pt in pts)
        {
            // for every Hi endpoint ...
            if (pt.isHi)
            {
                // check every other line whose X-range is either wholly 
                // contained within our own, or that overlaps the high 
                // part of ours.  The other two cases of overlap (overlaps 
                // our low end, or wholly contains us) is covered by hi 
                // points above that scan down to check us.

                // scan down for each lo-endpoint below us checking each's 
                // line for intersection until we pass our own lo-X value
                for (endpointEntry pLo = pt.fLink; (pLo != null) && (pLo.XValue >= pt.lo); pLo = pLo.fLink)
                {
                    // is this a lo-endpoint?
                    if (!pLo.isHi)
                    {
                        // check its line for intersection
                        if (pt.line.intersectsLine(pLo.line))
                            return true;
                    }
                }
            }
        }

        return false;
    }
}

I am not certain what the true execution complexity of this algorithm is, but I suspect that for most non-pathological polygons it will be close to O(n*SQRT(n)) which should be fast enough.


Explanation of the Inner Loop logic:

The Inner Loop simply scans the endPoints list in the same sorted order as the outer loop. But it will start scanning from where the outer loop from where the outer loop currently is in the list (which is the hiX point of some line), and will only scan until the endPoint values drop below the corresponding loX value of that same line.

What's tricky here, is that this cannot be done with an Enumerator (the foreach(..in pts) of the outer loop) because there's no way to enumerate a sublist of a list, nor to start the enumeration based on another enumerations current position. So instead what I did was to use the Forward and Backward Links (fLink and bLink) properties to make a doubly-linked list structure that retains the sorted order of the list, but that I can incrementally scan without enumerating the list:

for (endpointEntry pLo = pt.fLink; (pLo != null) && (pLo.XValue >= pt.lo); pLo = pLo.fLink)

Breaking this down, the old-style for loop specifier has three parts: initialization, condition, and increment-decrement. The initialization expression, endpointEntry pLo = pt.fLink; initializes pLo with the forward Link of the current point in the list. That is, the next point in the list, in descending sorted order.

Then the body of the inner loop gets executed. Then the increment-decrement pLo = pLo.fLink gets applied, which simply sets the inner loop's current point (pLo) to the next lower point using it's forward-link (pLo.fLink), thus advancing the loop.

Finally, it loops after testing the loop condition (pLo != null) && (pLo.XValue >= pt.lo) which loops so long as the new point isn't null (which would mean that we were at the end of the list) and so long as the new point's XValue is still greater than or equal to the low X value of the outer loop's current point. This second condition insures that the inner loop only looks at lines that are overlapping the line of the outer loop's endpoint.


What is clearer to me now, is that I probably could have gotten around this whole fLink-bLink clumsiness by treating the endPoints List as an array instead:

  1. Fill up the list with endPoints
  2. Sort the List by XValue
  3. Outer Loop through the list in descending order, using an index iterator (i), instead of a foreach enumerator
  4. (A) Inner Loop through the list, using a second iterator (j), starting at i and ending when it passed below pt.Lo.

That I think would be much simpler. I can post a simplified version like that, if you want.

RBarryYoung
  • 55,398
  • 14
  • 96
  • 137
  • It takes a total of 15 minutes to run all 700,000 polygons! That's extremely good. I just need to check if it doesn't miss out on any. – Evan Parsons Oct 04 '13 at 16:39
  • @EvanParsons How did the testing go? Did this algorithm work out for you? – RBarryYoung Oct 05 '13 at 22:22
  • I'm 99% sure it's doing what I want. The only issue is that I don't understand it completely yet (although I have a fairly good idea what it does), I'm going to dedicate some time tomorrow and make sure I understand it. I made some adjustments to my lineSegment intersection method and a few other areas and got the time down to 8:47. This is basically as fast as my original implementation of the Hoey-Shamos algorithm. Thanks a bunch! – Evan Parsons Oct 05 '13 at 23:56
  • Okay, so I think I generally understand what's going on. But what's going on in the inner loop? I've never seen a for loop like that before. – Evan Parsons Oct 09 '13 at 14:42
  • @EvanParsons I have added an explanation for this to the end of my Answer. – RBarryYoung Oct 09 '13 at 20:59
1

there are 2 things to check:

  1. closed polygon consists of cyclic sequence of points
    • if there is the same point in this sequence more than once than it is self intersecting polygon.
    • beware that the first and last point can be the same (differs on your polygon representation) in that case this point must be there more tan twice.
  2. check all not neighbouring lines of your polygon for intersection
    • not neighbouring lines do not share any point
    • if there is intersection then polygon is self intersecting
Spektre
  • 49,595
  • 11
  • 110
  • 380
  • no,... i am using C++ and in most cases (of my work) third party libs are not an option for many reasons. btw intersection detection in my pgms are fast enough. I think your comment is better suited for the question not for answer. BTW this is follow answer to double question for clarity take a look here http://stackoverflow.com/questions/18512815/implementing-hoey-shamos-algorithm-with-c-sharp?lq=1 – Spektre Oct 02 '13 at 12:38
  • Yep, haste. I've moved comment – SalientBrain Oct 02 '13 at 13:47
1

Simple optimisation that should half the time for polygons that do not intersect:

int count = lines.Count();
for (int l1idx = 0;       l1idx < count-1; l1idx++) 
for (int l2idx = l1idx+1; l2idx < count;   l2idx++)
{
  Line2D g = lines[l1idx];
  Line2D h = lines[l2idx];
  if (g.intersectsLine(h))
  {
    return false;
  }
}
weston
  • 54,145
  • 21
  • 145
  • 203
  • Do not use Count() on the List. It is linq extension method that creates Enumerator and MovesNext incrementing counter each item. For 700000 it will execute counter++ 700000 times. The List<> has a field Count and everyone should use it. That is simple value, Add(T) method is using Count for inserting new item right into the end of the List. It is also better to use array for ValueTypes to avoid boxing/unboxing. – Gleb Sevruk Jan 24 '15 at 04:28
  • 2
    @GlebSevruk You're incorrect about count, the linq extension returns the property value if it is an ICollection. See http://stackoverflow.com/a/7969468. You're also wrong about the generic boxing logic, that would be true for java but not c#. It is rare you would want to use an array http://stackoverflow.com/a/434765 – weston Jan 24 '15 at 08:01
  • Thaks for pointing. I still use 3.5 realized that "This was only added in .NET 4". Now I got the point with boxing, it will occur only in old .Net 2 non-typed List. since Ints are stored in the array of objects, and should be casted back and forth. The only issue left with List for storing plain values is dynamic resize, when, you add new element. Initialy List is created with buffer for 8 or 16 (`int[16]`). After it reach this limit, the new array, doubled in size,is allocated (`int[32]`) and data is moved arround in memory. I see little avareness about `new List(SIZE)` in developers – Gleb Sevruk Jan 24 '15 at 21:49