6

I have a table containing about 500 points and am looking for duplicates within a tolerance. This takes less than a second and gives me 500 rows. Most have a distance of zero because it gives the same point (PointA = PointB)

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL
   -- AND
   -- PointA.ObjectId <> PointB.ObjectID
ORDER BY ObjectIDa

If I use the commented out lines near the bottom, I get 14 rows but the execution time goes up to 14 seconds. Not that big a deal until my point table expands to 10's of thousands.

I apologize in advance if the answer is already out there. I did look, but being new I get lost reading posts which are way over my head.

ADDENDUM: ObjectID is a bigint and the PK for the table, so I realized that I could change the statement to

AND PointA.ObjectID > PointB.ObjectID

This now takes half the time and gives me half the results (7 rows in 7 seconds). I now don't get duplicates (as in Point 4 is close to Point 8 followed by Point 8 is close to Point 4). However the performance still concerns me as the table will be very large, so any performance issues will become problems.

ADDENDUM 2: Changing the order of the JOIN and AND (or WHERE as suggested) as below makes no difference either.

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.ObjectId < PointB.ObjectID
    WHERE
    PointA.Geometry.STDistance(PointB.Geometry) < @TOL
ORDER BY ObjectIDa

I find it fascinating that I can change the @Tol value to something large that returns over 100 rows with no change in performance even though it requires many computations. But then adding a simple A

Land Surveyor
  • 99
  • 1
  • 7
  • 1
    Do you have indexes on the table `SurveyPoint`? What about on the column `SurveyPoint.ObjectID`? – Mike Dec 30 '13 at 05:39
  • SurveyPoint.ObjectID is the PK and is indexed. I also have a apatial index on the geometry column. – Land Surveyor Dec 30 '13 at 17:41
  • Are you joining to the same table? – BlakeH Dec 30 '13 at 18:43
  • Do you care if PointA and PointB are NOT the same but the distance is 0? What if you checked if PointA.Geometry.STDistance(PointB.Geometry) > 0? – n8wrl Dec 30 '13 at 18:55
  • Yes - same table. The idea is to find points in the same table that are close enough to be considered as a single point. – Land Surveyor Dec 30 '13 at 19:08
  • I see you already solved the problem, not if you've had in mind the parameters of the index, I had a similar problem with a geocoding and significantly increase performance when these parameters change, the thing is Pretty index grows, the other is also that so continued to add information to the table, and that index as granulated affect insertions http://technet.microsoft.com/en-us/library/bb964712(v=sql.105).aspx – phipex Jan 03 '14 at 19:59

4 Answers4

2

The execution plan is probably doing something behind the scenes when you add in the ObjectID comparison. Check the execution plan to see if the two different versions of the query are, for example, using an index seek vs. a table scan. If so, consider experimenting with query hints.

As a workaround, you could always use a subquery:

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    ObjectIDa,
    PTNameA,
    PTdescA,
    ObjectIDb,
    PTNameB,
    PTdescB,
    DIST
FROM
(
SELECT 
  PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL
   -- AND
   -- PointA.ObjectId <> PointB.ObjectID
) Subquery
WHERE ObjectIDa <> ObjectIDb
ORDER BY ObjectIDa
Mike
  • 3,641
  • 3
  • 29
  • 39
  • Thanks. Subquery is new to me, but made no difference. I discovered how to view the estimated execution plan, and will spend some time learning about it - particularly the filter which is using 100%. I am still new at this. – Land Surveyor Dec 30 '13 at 19:06
  • 1
    My guess would be that the faster query is utilizing the spatial index, which sounds like it would be good for this query, since the goal is to find points that are close together. Perhaps introducing the ObjectId criteria causes the query optimizer to prefer the primary key's index instead, which would not be very useful considering the "not equals" condition. You may be able to see which index is being seeked / scanned in the execution plan for each query to confirm if that is what's happening. – Dr. Wily's Apprentice Dec 30 '13 at 21:40
2

This is a fun question.

It's not unrealistic that you get a large performance improvement by changing from "<>" to ">".

As others have mentioned, the trick is to get the most out of your indexes. Certainly by using ">", you should easily get the server to limit to that specific range on your PK - avoiding looking "backwards" when you've already checked looking "forwards".

This improvement will scale - will help as you add rows. But you're right to worry it won't help prevent any increase in work. As you're correctly thinking, as long as you have to scan a larger number of rows, it will take longer. And that's the case here because we always want to compare everything.

If the first part is looking good, just the TOL check, have you thought about splitting out the second part entirely?

Change the first part to dump into a temp table as

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST

into #AllDuplicatesWithRepeats

FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON 
    PointA.Geometry.STDistance(PointB.Geometry) < @TOL
ORDER BY ObjectIDa

And they you can write the direct query that skips duplicates, below. It isn't special, but against that small set in the temp table it should be perfectly speedy.

Select
    *
from    
    #AllDuplicatesWithRepeats d1
        left join #AllDuplicatesWithRepeats d2 on (
                        d1.objectIDa = d2.objectIDb
                        and
                        d1.objectIDb = d2.objectIDa
                        )
where
    d2.objectIDb is null
Mike M
  • 1,382
  • 11
  • 23
  • 1
    Yes, this has been fun and your solution works, though I did simplify the second select (no join necessary). Also, I am guessing that you forgot to remove one of the constraints in the first join? This is what I ended up with and it executed in 2 seconds. (Too long - I will paste in original question) – Land Surveyor Dec 31 '13 at 03:14
  • -- I fixed it. Thank you for catching that, @Land Surveyor. – Mike M Dec 31 '13 at 11:24
1

Try using PointA.ObjectId <> PointB.ObjectID with a WHERE clause between the JOIN and the ORDER BY clause.

Like so:

DECLARE @TOL AS REAL
SET @TOL = 0.05

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST
FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL
WHERE PointA.ObjectId <> PointB.ObjectID
ORDER BY ObjectIDa
EfrainReyes
  • 1,005
  • 2
  • 21
  • 55
1

With kudos to @Mike_M here is the edited Select which runs in 2 seconds.

SELECT 
    PointA.ObjectId as ObjectIDa,
    PointA.Name as PTNameA,
    PointA.[Description] as PTdescA,
    PointB.ObjectId as ObjectIDb,
    PointB.Name as PTNameB,
    PointB.[Description] as PTdescB,
    ROUND(PointA.Geometry.STDistance(PointB.Geometry),3) DIST

into #AllDuplicatesWithRepeats

FROM CadData.Survey.SurveyPoint PointA
  JOIN [CadData].Survey.SurveyPoint PointB
    ON PointA.Geometry.STDistance(PointB.Geometry) < @TOL  
ORDER BY ObjectIDa

Select
    *
from    
    #AllDuplicatesWithRepeats d1
Where
    d1.ObjectIDa < d1.ObjectIDb
Land Surveyor
  • 99
  • 1
  • 7