How to find intersecting geographies between two tables recursively

Question

I'm running Postgres 9.6.1 and PostGIS 2.3.0 r15146 and have two tables.
geographies may have 150,000,000 rows, paths may have 10,000,000 rows:

CREATE TABLE paths (id uuid NOT NULL, path path NOT NULL, PRIMARY KEY (id))
CREATE TABLE geographies (id uuid NOT NULL, geography geography NOT NULL, PRIMARY KEY (id))

Given an array/set of ids for table geographies, what is the "best" way of finding all intersecting paths and geometries?

In other words, if an initial geography has a corresponding intersecting path we need to also find all other geographies that this path intersects. From there, we need to find all other paths that these newly found geographies intersect, and so on until we've found all possible intersections.

The initial geography ids (our input) may be anywhere from 0 to 700. With an average around 40.
Minimum intersections will be 0, max will be about 1000. Average likely around 20, typically less than 100 connected.

I've created a function that does this, but I'm new to GIS in PostGIS, and Postgres in general. I've posted my solution as an answer to this question.

I feel like there should be a more eloquent and faster way of doing this than what I've come up with.

Would you consider storing the `pathIds` and `geographyIds` in tables instead of arrays? — Steve Chambers, Feb 01 '17 at 15:17
Absolutely. I have something working now that returns a single `table(id uuid, type character varying)` where `type` is either `path` or `geography`. — David Murdoch, Feb 01 '17 at 16:07
Consider asking over at [gis.se] where there are knowledgable PostGIS people. — Toby Speight, Feb 03 '17 at 15:03
Thanks @TobySpeight. I've asked a moderator to just move the question. — David Murdoch, Feb 03 '17 at 20:29
Please provide (in the question) your Postgres and PostGIS version numbers: `SELECT extversion FROM pg_extension WHERE extname = 'postgis';` and `SELECT version();` How many IDs are passed to the function initially (min / max / avg)? Also, the column `fk_id` used in your answer is missing in the table definition in the question. Please clarify. — Erwin Brandstetter, Feb 04 '17 at 19:50
I added the version info to the question and some estimates, as requested. `fk_id` is used in my actual database and is what I use to get the initial geographies. It is a bigint field that represents a geographic area (it's actually a precomputed s2cell id at level 15). There are always exactly 13 `fk_id`s passed in. Sorry I didn't include this in the initial question (I didn't want to get the question closed as "too localized"). — David Murdoch, Feb 04 '17 at 20:19
Adding a precise use case *never* makes it a candidate for "too localized". Your update only made the question more useful. — Erwin Brandstetter, Feb 05 '17 at 03:48
No problem. The days here have been very busy. I will still post after that. — Clodoaldo Neto, Feb 07 '17 at 17:23

Erwin Brandstetter · Accepted Answer · 2021-04-18T02:27:33.887

Your function can be radically simplified.

Setup

I suggest you convert the column paths.path to data type geography (or at least geometry). path is a native Postgres type and does not play well with PostGIS functions and spatial indexes. You would have to cast path::geometry or path::geometry::geography (resulting in a LINESTRING internally) to make it work with PostGIS functions like ST_Intersects().

My answer is based on these adapted tables:

CREATE TABLE paths (
   id uuid PRIMARY KEY
 , path geography NOT NULL
);

CREATE TABLE geographies (
   id uuid PRIMARY KEY
 , geography geography NOT NULL
 , fk_id text NOT NULL
);

Everything works with data type geometry for both columns just as well. geography is generally more exact but also more expensive. Which to use? Read the PostGIS FAQ here.

Solution 1: Your function optimized

CREATE OR REPLACE FUNCTION public.function_name(_fk_ids text[])
  RETURNS TABLE(id uuid, type text)
  LANGUAGE plpgsql AS
$func$
DECLARE
   _row_ct int;
   _loop_ct int := 0;

BEGIN
   CREATE TEMP TABLE _geo ON COMMIT DROP AS  -- dropped at end of transaction
   SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct AS loop_ct -- dupes possible?
   FROM   geographies g
   WHERE  g.fk_id = ANY(_fk_ids);

   GET DIAGNOSTICS _row_ct = ROW_COUNT;

   IF _row_ct = 0 THEN  -- no rows found, return empty result immediately
      RETURN;           -- exit function
   END IF;

   CREATE TEMP TABLE _path ON COMMIT DROP AS
   SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct AS loop_ct
   FROM   _geo  g
   JOIN   paths p ON ST_Intersects(g.geography, p.path);  -- no dupes yet

   GET DIAGNOSTICS _row_ct = ROW_COUNT;

   IF _row_ct = 0 THEN  -- no rows found, return _geo immediately
      RETURN QUERY SELECT g.id, text 'geo' FROM _geo g;
      RETURN;   
   END IF;

   ALTER TABLE _geo  ADD CONSTRAINT g_uni UNIQUE (id);  -- required for UPSERT
   ALTER TABLE _path ADD CONSTRAINT p_uni UNIQUE (id);

   LOOP
      _loop_ct := _loop_ct + 1;

      INSERT INTO _geo(id, geography, loop_ct)
      SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct
      FROM   _paths      p
      JOIN   geographies g ON ST_Intersects(g.geography, p.path)
      WHERE  p.loop_ct = _loop_ct - 1   -- only use last round!
      ON     CONFLICT ON CONSTRAINT g_uni DO NOTHING;  -- eliminate new dupes

      EXIT WHEN NOT FOUND;

      INSERT INTO _path(id, path, loop_ct)
      SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct
      FROM   _geo  g
      JOIN   paths p ON ST_Intersects(g.geography, p.path)
      WHERE  g.loop_ct = _loop_ct - 1
      ON     CONFLICT ON CONSTRAINT p_uni DO NOTHING;

      EXIT WHEN NOT FOUND;
   END LOOP;

   RETURN QUERY
   SELECT g.id, text 'geo'  FROM _geo g
   UNION ALL
   SELECT p.id, text 'path' FROM _path p;
END
$func$;

Call:

SELECT * FROM public.function_name('{foo,bar}');

Much faster than what you have.

Major points

You based queries on the whole set, instead of the latest additions to the set only. This gets increasingly slower with every loop without need. I added a loop counter (loop_ct) to avoid redundant work.

Be sure to have spatial GiST indexes on geographies.geography and paths.path:

  CREATE INDEX geo_geo_gix ON geographies USING GIST (geography);
  CREATE INDEX paths_path_gix ON paths USING GIST (path);

Since Postgres 9.5 index-only scans would be an option for GiST indexes. You might add id as second index column. The benefit depends on many factors, you'd have to test. However, there is no fitting operator GiST class for the uuid type. It would work with bigint after installing the extension btree_gist:

Postgres multi-column index (integer, boolean, and array)
Multicolumn index on 3 fields with heterogenous data types
Have a fitting index on g.fk_id, too. Again, a multicolumn index on (fk_id, id, geography) might pay if you can get index-only scans out of it. Default btree index, fk_id must be first index column. Especially if you run the query often and rarely update the table and table rows are much wider than the index.
You can initialize variables at declaration time. Only needed once after the rewrite.
ON COMMIT DROP drops the temp tables at the end of the transaction automatically. So I removed dropping tables explicitly. But you get an exception if you call the function in the same transaction twice. In the function I would check for existence of the temp table and use TRUNCATE in this case. Related:
How to check if a table exists in a given schema
Use GET DIAGNOSTICS to get the row count instead of running another query for the count.
Count rows affected by DELETE

You need GET DIAGNOSTICS. CREATE TABLE does not set FOUND (as is mentioned in the manual).

It's faster to add an index or PK / UNIQUE constraint after filling the table. And not before we actually need it.
ON CONFLICT ... DO ... is the simpler and cheaper way for UPSERT since Postgres 9.5.
How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?

For the simple form of the command you just list index columns or expressions (like ON CONFLICT (id) DO ...) and let Postgres perform unique index inference to determine an arbiter constraint or index. I later optimized by providing the constraint directly. But for this we need an actual constraint - a unique index is not enough. Fixed accordingly. Details in the manual here.

It may help to ANALYZE temporary tables manually to help Postgres find the best query plan. (But I don't think you need it in your case.)
Are regular VACUUM ANALYZE still recommended under 9.1?
_geo_ct - _geographyLength > 0 is an awkward and more expensive way of saying _geo_ct > _geographyLength. But that's gone completely now.
Don't quote the language name. Just LANGUAGE plpgsql.
Your function parameter is varchar[] for an array of fk_id, but you later commented:

It is a bigint field that represents a geographic area (it's actually a precomputed s2cell id at level 15).

I don't know s2cell id at level 15, but ideally you pass an array of matching data type, or if that's not an option default to text[].

Also since you commented:

There are always exactly 13 fk_ids passed in.

This seems like a perfect use case for a VARIADIC function parameter. So your function definition would be:

CREATE OR REPLACE FUNCTION public.function_name(_fk_ids VARIADIC text[]) ...

Details:

Pass multiple values in single parameter

Solution 2: Plain SQL with recursive CTE

It's hard to wrap an rCTE around two alternating loops, but possible with some SQL finesse:

WITH RECURSIVE cte AS (
   SELECT g.id, g.geography::text, NULL::text AS path, text 'geo' AS type
   FROM   geographies g
   WHERE  g.fk_id = ANY($kf_ids)  -- your input array here

   UNION
   SELECT p.id, g.geography::text, p.path::text
        , CASE WHEN p.path IS NULL THEN 'geo' ELSE 'path' END AS type
   FROM   cte              c
   LEFT   JOIN paths       p ON c.type = 'geo'
                            AND ST_Intersects(c.geography::geography, p.path)
   LEFT   JOIN geographies g ON c.type = 'path'
                            AND ST_Intersects(g.geography, c.path::geography)
   WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
   )
SELECT id, type FROM cte;

That's all.
You need the same indexes as above. You might wrap it into an SQL function for repeated use.

Major additional points

The cast to text is necessary because the geography type is not "hashable" (same for geometry). (See this open PostGIS issue for details.) Work around it by casting to text. Rows are unique by virtue of (id, type) alone, we can ignore the geography columns for this. Cast back to geography for the join. Shouldn't cost too much extra.
We need two LEFT JOIN so not to exclude rows, because at each iteration only one of the two tables may contribute more rows.
The final condition makes sure we are not done, yet:
```
  WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
```

This works because duplicate findings are excluded from the temporary intermediate table. The manual:

For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.

So which is faster?

The rCTE is probably faster than the function for small result sets. The temp tables and indexes in the function mean considerably more overhead. For large result sets the function may be faster, though. Only testing with your actual setup can give you a definitive answer.*

See the OP's feedback in the comment.

@DavidMurdoch: I would be interested: Did you happen to test both solutions and compare performance for small / big (with many iterations) result sets? (Thanks for the generous bounties, btw!) — Erwin Brandstetter, Feb 14 '17 at 15:38
I did. I'm going with the rCTE as it is faster and easier to manage. I ended up returning many more columns than just the ids and geometries (`geometry` looks to be significantly faster than `geography`). Curiously, I found that for certain columns selected from within a `CASE` it would increase the execution time by a factor of 10! I can't remember the specifics now, but I found it quite odd. — David Murdoch, Feb 15 '17 at 14:10
Performance testing can be tricky, many possible contributing factors. Probably the `CASE` made Postgres switch to a different query plan. Maybe issues with the server configuration ([cost settings, table statistics?](http://stackoverflow.com/a/8229000/939860)) Factor 10 sounds odd, though. The query plan often helps to understand. Might be material for another question if you need answers. — Erwin Brandstetter, Feb 16 '17 at 10:31
I tried reproducing again and couldn't figure it out. I showed some other devs in the office and they were baffled as well. If I ever figure it out I'll try to make a reduced test case and post a question about it. — David Murdoch, Feb 16 '17 at 12:39

David Murdoch · Answer 2 · 2017-02-03T21:12:05.107

I figured it'd be good to post my own solution here even if it isn't optimal.

Here is what I came up with (using Steve Chambers' advice):

CREATE OR REPLACE FUNCTION public.function_name(
    _fk_ids character varying[])
    RETURNS TABLE(id uuid, type character varying)
    LANGUAGE 'plpgsql'
    COST 100.0
    VOLATILE
    ROWS 1000.0
AS $function$

    DECLARE
        _pathLength bigint;
        _geographyLength bigint;

        _currentPathLength bigint;
        _currentGeographyLength bigint;
    BEGIN
        DROP TABLE IF EXISTS _pathIds;
        DROP TABLE IF EXISTS _geographyIds;
        CREATE TEMPORARY TABLE _pathIds (id UUID PRIMARY KEY);
        CREATE TEMPORARY TABLE _geographyIds (id UUID PRIMARY KEY);

        -- get all geographies in the specified _fk_ids
        INSERT INTO _geographyIds
            SELECT g.id
            FROM geographies g
            WHERE g.fk_id= ANY(_fk_ids);

        _pathLength := 0;
        _geographyLength := 0;
        _currentPathLength := 0;
        _currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
        -- _pathIds := ARRAY[]::uuid[];

        WHILE (_currentPathLength - _pathLength > 0) OR (_currentGeographyLength - _geographyLength > 0) LOOP
            _pathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
            _geographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);

            -- gets all paths that have paths that intersect the geographies that aren't in the current list of path ids

            INSERT INTO _pathIds 
                SELECT DISTINCT p.id
                    FROM paths p
                    JOIN geographies g ON ST_Intersects(g.geography, p.path)
                    WHERE
                        g.id IN (SELECT _geographyIds.id FROM _geographyIds) AND
                        p.id NOT IN (SELECT _pathIds.id from _pathIds);

            -- gets all geographies that have paths that intersect the paths that aren't in the current list of geography ids
            INSERT INTO _geographyIds 
                SELECT DISTINCT g.id
                    FROM geographies g
                    JOIN paths p ON ST_Intersects(g.geography, p.path)
                    WHERE
                        p.id IN (SELECT _pathIds.id FROM _pathIds) AND
                        g.id NOT IN (SELECT _geographyIds.id FROM _geographyIds);

            _currentPathLength := (SELECT COUNT(_pathIds.id) FROM _pathIds);
            _currentGeographyLength := (SELECT COUNT(_geographyIds.id) FROM _geographyIds);
        END LOOP;

        RETURN QUERY
            SELECT _geographyIds.id, 'geography' AS type FROM _geographyIds
            UNION ALL
            SELECT _pathIds.id, 'path' AS type FROM _pathIds;
    END;

$function$;

I am confused about `ST_Intersects(g.geography, p.path)` in your answer. There is no variant of `ST_Intersects()` in PostGIS 2.3.1 that would accept the data type `path`, and the only registered cast targets for `path` are `point`, `polygon` and `geometry`. Is `paths.path` really data type `path`? — Erwin Brandstetter, Feb 04 '17 at 23:24
You are totally right. `p.path` is actually a `geometry` type in my database! Sorry about that! I don't have a problem changing the type to `geography`, so your answer looks like it'll be the winner. For this function, is it safe to assume using the `geography` type would be more performant than a `geometry` since I would be avoiding a cast? — David Murdoch, Feb 05 '17 at 13:01
The cast is very cheap. Operating on the `geography` type is generally more exact but also more expensive. I suggest to do all tests on either two `geometry` or two `geography` values. Ideally, you have matching types. If the underlying tables have different types, explicitly cast to one type in the the query (`geography::geometry`) and have a matching expression index for the respective column, like `CREATE INDEX ON geographies USING GIST ((geography::geometry));` Note the extra parentheses. Which to chose? Read: http://postgis.net/docs/PostGIS_FAQ.html#idm1018 — Erwin Brandstetter, Feb 05 '17 at 16:54

Clodoaldo Neto · Answer 3 · 2017-02-09T17:51:38.310

Sample plot and data from this script

It can be pure relational with an aggregate function. This implementation uses one path table and one point table. Both are geometries. The point is easier to create test data with and to test than a generic geography but it should be simple to adapt.

create table path (
    path_text text primary key,
    path geometry(linestring) not null
);
create table point (
   point_text text primary key,
   point geometry(point) not null
);

A type to keep the aggregate function's state:

create type mpath_mpoint as (
    mpath geometry(multilinestring),
    mpoint geometry(multipoint)
);

The state building function:

create or replace function path_point_intersect (
    _i mpath_mpoint[], _e mpath_mpoint
) returns mpath_mpoint[] as $$

    with e as (select (e).mpath, (e).mpoint from (values (_e)) e (e)),
    i as (select mpath, mpoint from unnest(_i) i (mpath, mpoint))
    select array_agg((mpath, mpoint)::mpath_mpoint)
    from (
        select
            st_multi(st_union(i.mpoint, e.mpoint)) as mpoint,
            (
                select st_collect(gd)
                from (
                    select gd from st_dump(i.mpath) a (a, gd)
                    union all
                    select gd from st_dump(e.mpath) b (a, gd)
                ) s
            ) as mpath
        from i inner join e on st_intersects(i.mpoint, e.mpoint)

        union all
        select i.mpoint, i.mpath
        from i inner join e on not st_intersects(i.mpoint, e.mpoint)

        union all
        select e.mpoint, e.mpath
        from e
        where not exists (
            select 1 from i
            where st_intersects(i.mpoint, e.mpoint)
        )
    ) s;
$$ language sql;

The aggregate:

create aggregate path_point_agg (mpath_mpoint) (
    sfunc = path_point_intersect,
    stype = mpath_mpoint[]
);

This query will return a set of multilinestring, multipoint strings containing the matched paths/points:

select st_astext(mpath), st_astext(mpoint)
from unnest((
    select path_point_agg((st_multi(path), st_multi(mpoint))::mpath_mpoint)
    from (
        select path, st_union(point) as mpoint
        from
            path 
            inner join
            point on st_intersects(path, point)
        group by path
    ) s
)) m (mpath, mpoint)
;
                         st_astext                         |          st_astext          
-----------------------------------------------------------+-----------------------------
 MULTILINESTRING((-10 0,10 0,8 3),(0 -10,0 10),(2 1,4 -1)) | MULTIPOINT(0 0,0 5,3 0,5 0)
 MULTILINESTRING((-9 -8,4 -8),(-8 -9,-8 6))                | MULTIPOINT(-8 -8,2 -8)
 MULTILINESTRING((-7 -4,-3 4,-5 6))                        | MULTIPOINT(-6 -2)