all IDs such that lon
and lat
are less then a tolerance value away
A simple self-join as demonstrated by forpas can solve the task.
However, the simple query has to compute deltas for every combination of two rows before eliminating pairs outside the given tolerance. Basically a Cartesian Product, which scales terribly with O(N²). If your table isn't trivially small, keep reading.
The target is to get O(N) instead. We're going to use some advanced concepts:
- nearest neighbour search
- GiST expression index
LATERAL
join
- recursive CTE
Nearest neighbour query
If you are unfamiliar, here is an introduction to Nearest-Neighbour Searching with PostGis (which may be in use for your geocodes?).
Either way, we can do the same with the type point
in stock Postgres. We need a GiST index, while sticking to your original original table, based on an expression:
CREATE INDEX tbl_point_gist_idx ON tbl USING GiST (point(lon, lat));
Basic query - fast but imperfect
SELECT t.*, t1.id AS near_id, t1.lon AS near_lon, t1.lat AS near_lat
FROM tbl t
JOIN LATERAL (
SELECT *
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY point(t.lon, t.lat) <-> point(t1.lon, t1.lat)
LIMIT 5 -- arbitrary
) t1 ON abs(t.lon - t1.lon) < 0.0011
AND abs(t.lat - t1.lat) < 0.0011;
id |
lon |
lat |
near_id |
near_lon |
near_lat |
1 |
10.111 |
20.415 |
3 |
10.110 |
20.414 |
3 |
10.110 |
20.414 |
1 |
10.111 |
20.415 |
db<>fiddle here
Note the LATERAL
join: another advanced concept. See:
Using the distance operator <->
to compute geometric distance between points.
A simple correlated subquery wouldn't do. We need to get the nearest neighbours before filtering for distance. This can use the GiST index very efficiently. Else, if a row has no qualifying neighbours, Postgres would have to search the whole table before giving up - which defies the purpose.
Problem: We are forced to use an arbitrary LIMIT
, 5 in the example. Additional neighbours within tolerance will be snubbed. We'd need to use the maximum number of possible neighbours, which is yet unknown. With few big outliers it would also be inefficient to overshoot for the rest.
Also, the basic query returns one row per neighbour. We want to aggregate to a single row with all neighbours.
Fast and almost correct
Marry the above with another advanced concept: a recursive CTE:
WITH RECURSIVE d (tol) AS (SELECT float8 '0.001') -- input tolerance here
, cte AS (
SELECT t.id, t.lon, t.lat, d.tol
, 1 AS step, ARRAY [t1.id] AS neighbors
FROM tbl t
CROSS JOIN d
JOIN LATERAL (
SELECT *
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY point(t.lon, t.lat) <-> point(t1.lon, t1.lat)
LIMIT 1
) t1 ON abs(t.lon - t1.lon) < d.tol
AND abs(t.lat - t1.lat) < d.tol
UNION ALL
SELECT t.id, t.lon, t.lat, t.tol
, t.step + 1, neighbors || t1.id
FROM cte t
JOIN LATERAL (
SELECT *
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY point(t.lon, t.lat) <-> point(t1.lon, t1.lat) -- !
OFFSET t.step -- !
LIMIT 1 -- !
) t1 ON abs(t.lon - t1.lon) < t.tol -- !
AND abs(t.lat - t1.lat) < t.tol -- !
)
SELECT DISTINCT ON (id) id, lon, lat, neighbors
FROM cte
ORDER BY id, step DESC;
id |
lon |
lat |
neighbours |
1 |
10.111 |
20.415 |
{3} |
3 |
10.110 |
20.414 |
{1} |
db<>fiddle here
The first (non-recursive) CTE d
is just for convenience, to provide the tolerance once. If you wrap the query in a function or work with a prepared statement of concatenate it dynamically, you can drop it and pass the value in multiple spots directly.
The next CTE cte
is the rCTE. The non-recursive term only picks the nearest neighbour for each point (LIMIT 1
). If it's within tolerance, keep looking, else we're done - eliminating all rows without neighbours immediately.
In the recursive term, increase the OFFSET
with every step
to inspect the next neighbour, etc.
In the outer SELECT
we only take the last step per id
with DISTINCT ON
. See:
The remaining corner case error is this: you check for abs(lon_i - lon_j) < tol AND abs(lat_i-lat_j) < tol
, and that's different from the geometric distance by up to factor sqrt(2)
. So the query might stop at a neighbour outside tolerance, while the next neighbour further away should still be included. Bounding box vs. bounding circle.
Faster and correct
Check for actual geometric distance instead. Typically makes more sense to begin with:
WITH RECURSIVE d(tol) AS (SELECT float8 '0.001415') -- input tolerance here
, cte AS (
SELECT t.id, t.lon, t.lat, d.tol
, 1 AS step, ARRAY [t1.id] AS neighbors
FROM tbl t
CROSS JOIN d
JOIN LATERAL (
SELECT *, point(t.lon, t.lat) <-> point(t1.lon, t1.lat) AS dist
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY dist
LIMIT 1
) t1 ON dist < d.tol
UNION ALL
SELECT t.id, t.lon, t.lat, t.tol
, t.step + 1, neighbors || t1.id
FROM cte t
JOIN LATERAL (
SELECT *, point(t.lon, t.lat) <-> point(t1.lon, t1.lat) AS dist
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY dist -- !
OFFSET t.step -- !
LIMIT 1 -- !
) t1 ON dist < t.tol -- !
)
SELECT DISTINCT ON (id) id, lon, lat, neighbors
FROM cte
ORDER BY id, step DESC;
Same result.
db<>fiddle here
Better yet: work with point
to begin with
Improved table defintion:
CREATE TABLE tbl (id int PRIMARY KEY, lonlat point); -- !!
INSERT INTO tbl VALUES
(1, '10.111, 20.415')
, (2, '10.099, 30.132')
, (3, '10.110, 20.414')
;
A point
consists of two float8 quantities and occupies 16 bytes on disk.
Now index the column directly (cheaper):
CREATE INDEX tbl_lonlat_gist_idx ON tbl USING GiST (lonlat);
The query gets simpler and a bit faster, yet:
WITH RECURSIVE d(tol) AS (SELECT float8 '0.001415') -- input tolerance here
, cte AS (
SELECT t.id, t.lonlat, d.tol, 1 AS step, ARRAY[t1.id] AS neighbours
FROM tbl t
CROSS JOIN d
JOIN LATERAL (
SELECT *, t.lonlat <-> t1.lonlat AS dist
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY dist
LIMIT 1
) t1 ON dist < d.tol
UNION ALL
SELECT t.id, t.lonlat, t.tol, t.step + 1, neighbours || t1.id
FROM cte t
JOIN LATERAL (
SELECT *, t.lonlat <-> t1.lonlat AS dist
FROM tbl t1
WHERE t1.id <> t.id
ORDER BY dist -- !
OFFSET t.step -- !
LIMIT 1 -- !
) t1 ON dist < t.tol -- !
)
SELECT DISTINCT ON (id) id, lonlat, neighbours
FROM cte
ORDER BY id, step DESC;
id |
lonlat |
neighbours |
1 |
(10.111,20.415) |
{3} |
3 |
(10.11,20.414) |
{1} |
db<>fiddle here
Related: