Find rows with a column value close to the one of the current row

Question

I have a table in the following format:

id	lon	lat
1	10.111	20.415
2	10.099	30.132
3	10.110	20.414

I would like to create a new column that returns all IDs such that lon and lat are less then a tolerance value away, i.e. abs(lon_i - lon_j) < tol AND abs(lat_i-lat_j) < tol for all i != j.

My initial thought going about it is to create a temporary duplicate of the table and then join the table giving the names of the columns lon_2, lat_2, and then:

SELECT id 
FROM TABLE_1
WHERE ABS(TABLE_1.lon - TABLE_1.lon_2) < tol 
  AND ABS(TABLE_1.lat - TABLE_1.lat_2) < tol

But there must be a better and more efficient way to approach it.

You're doing it wrong. Longitude values for example ranges from -180 to + 180. So if you have the longitudes -179 and +179, subtracting them will tell you that they are 358° apart where in fact they are 2° apart (and that could be less than your tolerance value). — Salman A, Nov 05 '21 at 13:15
@SalmanA: It's typically an exotic corner case. The the 180° line of longitude passes through the middle of the Pacific. Unless we are dealing with flight or shipping data or some other exotic application, only the Russian provinz Chukotka and one of the Fidji islands might be affected. Even then it would be extremely unlikely to cross the date line with a bit of tolerance. Doesn't change the fact that you are right, of course. Operating with `geography` in PostGis would be the clean way. — Erwin Brandstetter, Nov 05 '21 at 19:19

score 2 · Accepted Answer · answered Oct 31 '21 at 14:36

There is no need for a temporary table.

You can do a self join:

SELECT t1.id, t2.id
FROM tablename t1 INNER JOIN tablename t2
ON t2.id <> t1.id AND ABS(t1.lon - t2.lon) < :tol AND ABS(t1.lat - t2.lat) < :tol;

This will return 1 row for each pair of ids that satisfies the conditions.

If you want for each id a comma separated list of all the ids that satisfy the conditions then you can aggregate and use STRING_AGG():

SELECT t1.id, STRING_AGG(t2.id, ';' ORDER BY t2.id)
FROM tablename t1 INNER JOIN tablename t2
ON t2.id <> t1.id AND ABS(t1.lon - t2.lon) < :tol AND ABS(t1.lat - t2.lat) < :tol
GROUP BY t1.id;

Erwin Brandstetter · Answer 2 · 2021-11-03T01:32:41.980

all IDs such that lon and lat are less then a tolerance value away

A simple self-join as demonstrated by forpas can solve the task.

However, the simple query has to compute deltas for every combination of two rows before eliminating pairs outside the given tolerance. Basically a Cartesian Product, which scales terribly with O(N²). If your table isn't trivially small, keep reading.

The target is to get O(N) instead. We're going to use some advanced concepts:

nearest neighbour search
GiST expression index
LATERAL join
recursive CTE

Nearest neighbour query

If you are unfamiliar, here is an introduction to Nearest-Neighbour Searching with PostGis (which may be in use for your geocodes?).

Either way, we can do the same with the type point in stock Postgres. We need a GiST index, while sticking to your original original table, based on an expression:

CREATE INDEX tbl_point_gist_idx ON tbl USING GiST (point(lon, lat));

Basic query - fast but imperfect

SELECT t.*, t1.id AS near_id, t1.lon AS near_lon, t1.lat AS near_lat
FROM   tbl t
JOIN   LATERAL (
   SELECT *
   FROM   tbl t1
   WHERE  t1.id <> t.id
   ORDER  BY point(t.lon, t.lat) <-> point(t1.lon, t1.lat)
   LIMIT  5  -- arbitrary
   ) t1 ON abs(t.lon - t1.lon) < 0.0011
       AND abs(t.lat - t1.lat) < 0.0011;

id	lon	lat	near_id	near_lon	near_lat
1	10.111	20.415	3	10.110	20.414
3	10.110	20.414	1	10.111	20.415

db<>fiddle here

Note the LATERAL join: another advanced concept. See:

What is the difference between LATERAL JOIN and a subquery in PostgreSQL?

Using the distance operator <-> to compute geometric distance between points.

A simple correlated subquery wouldn't do. We need to get the nearest neighbours before filtering for distance. This can use the GiST index very efficiently. Else, if a row has no qualifying neighbours, Postgres would have to search the whole table before giving up - which defies the purpose.

Problem: We are forced to use an arbitrary LIMIT, 5 in the example. Additional neighbours within tolerance will be snubbed. We'd need to use the maximum number of possible neighbours, which is yet unknown. With few big outliers it would also be inefficient to overshoot for the rest.

Also, the basic query returns one row per neighbour. We want to aggregate to a single row with all neighbours.

Fast and almost correct

Marry the above with another advanced concept: a recursive CTE:

WITH RECURSIVE d (tol) AS (SELECT float8 '0.001')  -- input tolerance here
, cte AS (
   SELECT t.id, t.lon, t.lat, d.tol
        , 1 AS step, ARRAY [t1.id] AS neighbors
   FROM   tbl t
   CROSS  JOIN d
   JOIN   LATERAL (
      SELECT *
      FROM   tbl t1
      WHERE  t1.id <> t.id
      ORDER  BY point(t.lon, t.lat) <-> point(t1.lon, t1.lat)
      LIMIT  1
      ) t1 ON abs(t.lon - t1.lon) < d.tol
          AND abs(t.lat - t1.lat) < d.tol

   UNION ALL
   SELECT t.id, t.lon, t.lat, t.tol
        , t.step + 1, neighbors || t1.id
   FROM   cte t
   JOIN   LATERAL (
      SELECT *
      FROM   tbl t1
      WHERE  t1.id <> t.id
      ORDER  BY point(t.lon, t.lat) <-> point(t1.lon, t1.lat)  -- !
      OFFSET t.step                                            -- !
      LIMIT  1                                                 -- !
      ) t1 ON abs(t.lon - t1.lon) < t.tol                      -- !
          AND abs(t.lat - t1.lat) < t.tol                      -- !
   )
SELECT DISTINCT ON (id) id, lon, lat, neighbors
FROM   cte
ORDER  BY id, step DESC;

id	lon	lat	neighbours
1	10.111	20.415	{3}
3	10.110	20.414	{1}

db<>fiddle here

The first (non-recursive) CTE d is just for convenience, to provide the tolerance once. If you wrap the query in a function or work with a prepared statement of concatenate it dynamically, you can drop it and pass the value in multiple spots directly.

The next CTE cte is the rCTE. The non-recursive term only picks the nearest neighbour for each point (LIMIT 1). If it's within tolerance, keep looking, else we're done - eliminating all rows without neighbours immediately.
In the recursive term, increase the OFFSET with every step to inspect the next neighbour, etc.

In the outer SELECT we only take the last step per id with DISTINCT ON. See:

Select first row in each GROUP BY group?

The remaining corner case error is this: you check for abs(lon_i - lon_j) < tol AND abs(lat_i-lat_j) < tol, and that's different from the geometric distance by up to factor sqrt(2). So the query might stop at a neighbour outside tolerance, while the next neighbour further away should still be included. Bounding box vs. bounding circle.

Faster and correct

Check for actual geometric distance instead. Typically makes more sense to begin with:

WITH RECURSIVE d(tol) AS (SELECT float8 '0.001415')  --  input tolerance here
, cte AS (
   SELECT t.id, t.lon, t.lat, d.tol
        , 1 AS step, ARRAY [t1.id] AS neighbors
   FROM   tbl t
   CROSS  JOIN d
   JOIN   LATERAL (
      SELECT *, point(t.lon, t.lat) <-> point(t1.lon, t1.lat) AS dist
      FROM   tbl t1
      WHERE  t1.id <> t.id
      ORDER  BY dist
      LIMIT  1
      ) t1 ON dist < d.tol

   UNION ALL
   SELECT t.id, t.lon, t.lat, t.tol
        , t.step + 1, neighbors || t1.id
   FROM   cte t
   JOIN   LATERAL (
      SELECT *, point(t.lon, t.lat) <-> point(t1.lon, t1.lat) AS dist
      FROM   tbl t1
      WHERE  t1.id <> t.id
      ORDER  BY dist         -- !
      OFFSET t.step          -- !
      LIMIT  1               -- !
      ) t1 ON dist < t.tol   -- !
   )
SELECT DISTINCT ON (id) id, lon, lat, neighbors
FROM   cte
ORDER  BY id, step DESC;

Same result.

db<>fiddle here

Better yet: work with `point` to begin with

Improved table defintion:

CREATE TABLE tbl (id int PRIMARY KEY, lonlat point);  -- !!
INSERT INTO tbl VALUES
  (1,  '10.111, 20.415')
, (2,  '10.099, 30.132')
, (3,  '10.110, 20.414')
;

A point consists of two float8 quantities and occupies 16 bytes on disk.

Now index the column directly (cheaper):

CREATE INDEX tbl_lonlat_gist_idx ON tbl USING GiST (lonlat);

The query gets simpler and a bit faster, yet:

WITH RECURSIVE d(tol) AS (SELECT float8 '0.001415')  --  input tolerance here
, cte AS (
   SELECT t.id, t.lonlat, d.tol, 1 AS step, ARRAY[t1.id] AS neighbours
   FROM   tbl t
   CROSS  JOIN d
   JOIN   LATERAL (
      SELECT *, t.lonlat <-> t1.lonlat AS dist
      FROM   tbl t1
      WHERE  t1.id <> t.id
      ORDER  BY dist
      LIMIT  1
      ) t1 ON dist < d.tol

   UNION ALL
   SELECT t.id, t.lonlat, t.tol, t.step + 1, neighbours || t1.id
   FROM   cte t
   JOIN   LATERAL (
      SELECT *, t.lonlat <-> t1.lonlat AS dist
      FROM   tbl t1
      WHERE  t1.id <> t.id
      ORDER  BY dist         -- !
      OFFSET t.step          -- !
      LIMIT  1               -- !
      ) t1 ON dist < t.tol   -- !
   )
SELECT DISTINCT ON (id) id, lonlat, neighbours
FROM   cte
ORDER  BY id, step DESC;

id	lonlat	neighbours
1	(10.111,20.415)	{3}
3	(10.11,20.414)	{1}

db<>fiddle here

Superb answer and explanation as usual. Want to try to go for **perfect** by considering that lat-lon pairs are spherical coordinates and not on a Cartesian plane as assumed by `<->`? See my answer [here](https://stackoverflow.com/questions/22602722/which-geo-implementation-to-use-for-millions-of-points/22603215#22603215). — Patrick, Nov 03 '21 at 13:06
@Patrick: The curvature of the earth plays a bigger role in other cases. Since this is all about finding points right next to each other, it's less important here. Else, we might work with proper `geography` types and PostGis functions. Great answer over there, though! — Erwin Brandstetter, Nov 03 '21 at 13:29
@gbox: I would be interested in the performance of my queries compared to the simple query in your actual table, along with the row count an your version of Postgres. — Erwin Brandstetter, Nov 04 '21 at 02:03

Find rows with a column value close to the one of the current row

2 Answers2

Nearest neighbour query

Basic query - fast but imperfect

Fast and almost correct

Faster and correct

Better yet: work with point to begin with

Better yet: work with `point` to begin with