1

I'm trying to make a spatial query on two tables with around 10m rows each.

What I'm trying to do is to relate table1 to table2 based on the levenshtein distance of t1 and t2 in a radius of 50m around t1. If no record is in the radius, we chose the nearest neighbour.

My approach was to create two views: one for the levenshtein distance, and one for the nearest neighbour.

-- VIEW1
SELECT ID,a,b
FROM TABLE(SDO_JOIN(t1, 'SHAPE', t2, 'SHAPE', 'DISTANCE=50 UNIT=M')) geom_join
LEFT JOIN t1 ON geom_join."ROWID1" = t1."ROWID"
LEFT JOIN t2 ON geom_join."ROWID2" = t2."ROWID"
WHERE SDO_GEOM.VALIDATE_GEOMETRY(t2."SHAPE", 0.005) = 'TRUE'
AND SDO_GEOM.VALIDATE_GEOMETRY(t1."SHAPE", 0.005) = 'TRUE'

and taking the MIN of levenshtein distance.
-- VIEW2
SELECT ID,a,b
FROM t1, t2
WHERE SDO_NN(t2, t1, 'SDO_NUM_RES=1, UNIT=M') = 'TRUE'
AND SDO_GEOM.VALIDATE_GEOMETRY(t1."SHAPE", 0.005) = 'TRUE'
AND SDO_GEOM.VALIDATE_GEOMETRY(t2."SHAPE", 0.005) = 'TRUE'

And then create the table from these views as

SELECT ID, 
(CASE WHEN VIEW1.a IS NULL THEN VIEW2.a ELSE VIEW1.a END) a,
(CASE WHEN VIEW1.b IS NULL THEN VIEW2.b ELSE VIEW1.b END) b
FROM VIEW2
LEFT JOIN VIEW1 USING(ID)

This seems to work for small datasets, however fails for the real tables and raise the following error:

cx_Oracle.DatabaseError: ORA-29902: error in executing ODCIIndexStart() routine
ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT
ORA-06512: at "MDSYS.SDO_INDEX_METHOD_10I", line 588

I could potentially raise the limit (currently 4Gb), however I'd like to check first if there was a solution without doing that.

enter image description here

Jon Heller
  • 34,999
  • 6
  • 74
  • 132
ReineLow
  • 21
  • 3
  • Does this problem only happen with the final select query, or dies it happen with one of the views? (When testing, make sure to select all of the data from the views and not just the first N rows.) If bad performance starts with one of the views, then this problem can be simplified and we don't need to tune the whole thing. If the bad performance only happens when you combine the two views, then there are some easy ways to force Oracle to evaluate the views separately, like I demonstrate in [this answer](https://stackoverflow.com/a/39761329/409172). – Jon Heller Feb 27 '22 at 19:16
  • I would start by leaving out the validation tests: they can be quite expensive, if the shapes themselves are expensive. A better approach is to validate all shapes once and for all, and to correct the errors. – Albert Godfrind Feb 28 '22 at 13:36
  • What geometries are those ? Points ? Lines ? Polygons ? What coordinate system ? – Albert Godfrind Feb 28 '22 at 13:37
  • "Levenshtein distance" is not what this computes. Levensthein distance is a form of edit distance between strings. The distance computed here is based on the straight line geometric distance between 2 geometric shapes in a 2D space. – Albert Godfrind Feb 28 '22 at 13:39
  • @AlbertGodfrind Both tables have line geometries and are both in SRID 2154. As for the levenshtein distance of the first view, I didn't add that part but you can consider one of the output attribute being something like the minimum of"UTL_MATCH.EDIT_DISTANCE("...", "...") DIST_LEV", with an aggregation on the ID. – ReineLow Feb 28 '22 at 17:56
  • The query plan appears to only do full table scans and hash joins. It is hard to tell which step is for which since the query plan you shared does not show the names of the tables or where the filters are applied. The estimated costs and cardinalities are just massive. Also I see two RIGHT OUTER JOINS where there should just be one LEFT OUTER JOIN, but that may be the optimizer rewriting the query. – Albert Godfrind Mar 01 '22 at 13:35
  • It's also not clear what `ID`, `a` and `b` are, or rather what table (T1 or T2) they come from. I assume ID is a unique identifier for the shapes in one of the tables ? But which one ? – Albert Godfrind Mar 01 '22 at 13:38
  • Finally you mention taking the minimum of the distance in view1 aggregated by ID. But that does not appear in your code. I think it would be better if you could explain what actual problem you are trying to solve, and also include some sample data for both tables that we can use to better understand. – Albert Godfrind Mar 01 '22 at 13:41

0 Answers0