Sorting a large spatial selection is not using GiST index (Postgres 11.5)

Question

I'm having a table (demo) with a sequence as its primary key (seqno) and a geometry property contained within a JSONB column (doc). I have configured a primary key constraint for the sequence column and a GiST index for the geometry. I have already gathered statistics by running VACUUM ANALYZE. It's a fairly large table (42M rows).

CREATE TABLE demo
(
    seqno bigint NOT NULL DEFAULT nextval('seqno'::regclass),
    doc jsonb NOT NULL DEFAULT '{}'::jsonb,
    CONSTRAINT demo_pkey PRIMARY KEY (seqno)
)

CREATE INDEX demo_doc_geometry_gist
ON demo USING gist (st_geometryfromtext(doc ->> 'geometry'::text))

I want to perform a spatial filter on a rather large area and return the first 10 rows, sorted by its primary key. Therefore, I have tried the following query:

SELECT seqno, doc
FROM demo
WHERE ST_Within(ST_GeometryFromText((doc->>'geometry')), ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))'))
ORDER BY seqno
LIMIT 10

This results in the following query plan:

Limit  (cost=1000.59..15169.06 rows=10 width=633) (actual time=2479.372..2496.737 rows=10 loops=1)
  ->  Gather Merge  (cost=1000.59..19780184.81 rows=13960 width=633) (actual time=2479.370..2496.732 rows=10 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Parallel Index Scan using demo_pkey on demo  (cost=0.56..19777573.45 rows=5817 width=633) (actual time=2440.310..2450.101 rows=5 loops=3)
              Filter: (('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry ~ st_geometryfromtext((doc ->> 'geometry'::text))) AND _st_contains('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry, st_geometryfromtext((doc ->> 'geometry'::text))))
              Rows Removed by Filter: 221313
Planning Time: 0.375 ms
Execution Time: 2496.786 ms

This shows that the primary key constraint index is used to scan all rows and perform the spatial filter on each row, which is obviously very inefficient. There are more than 5M matches for the given spatial predicate. The GiST index is not used at all.

However, when leaving out the ORDER BY clause, the GiST index for the geometry property is properly used, which is far more efficient.

Limit  (cost=0.42..128.90 rows=10 width=633) (actual time=0.381..0.745 rows=10 loops=1)
  ->  Index Scan using demo_doc_geometry_gist on demo  (cost=0.42..179352.99 rows=13960 width=633) (actual time=0.380..0.742 rows=10 loops=1)
        Index Cond: ('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry ~ st_geometryfromtext((doc ->> 'geometry'::text)))
        Filter: _st_contains('0103000020407100000100000005000000CFCA3EB32997F4402D3225A6F0D02041DDFD612B4A5F0141D66C69E40CCD20415E0E6F193D580141AE7BECF122511C412C99A20E8F48F440E6B3764403591C41CFCA3EB32997F4402D3225A6F0D02041'::geometry, st_geometryfromtext((doc ->> 'geometry'::text)))
Planning Time: 0.245 ms
Execution Time: 0.780 ms

Is there a way to make this query fast? Can we let the query planner combine the GiST index with the PK index to get a sorted result? Any other suggestions?

score 3 · Accepted Answer · answered Jul 28 '20 at 16:01

This shows that the primary key constraint index is used to scan all rows

It doesn't scan all rows, it stops after finding 10 of them which match. This would appear to be about 221313 * 3 + 10 rows, or about 1.6% of the total rows. It is not obvious that this is the wrong thing to do. You can suppress the usage of the primary key index by changing to ORDER BY seqno+0. This should use the GiST index, but I would not count on this being faster.

However, when leaving out the ORDER BY clause, the GiST index for the geometry property is properly used, which is far more efficient.

But it answers a far simpler question. Consider the difference between "find me 5 random people from Chicago" and "find me the 5 tallest people in Chicago".

As for making the query faster, I would try the ORDER BY seqno+0 trick. I don't think it will be faster, but I could be wrong.

I would also try a btree index on (seqno, doc) so you can get an index-only-scan, although this would be much better if your geometry was in its own column, not embedded in JSONB, so you could index just the seqno and the geometry rather than the whole JSONB. In theory PostgreSQL could give you an index only scan for an index on (seqno, ST_GeometryFromText(doc->>'geometry')), but it just isn't clever enough to realize this.

You could also try a multi-column GiST index on (seqno, ST_GeometryFromText(doc->>'geometry')) using the btree_gist extension to enable the inclusion of seqno.

Finally, you could try range partitioning your table on seqno. This would require a reorganization of your dataset, so isn't as simple as just building an index.

Thanks for your suggestion!.I've ended up introducing an extra column "bbox", containing a redundant copy of the geometry, but only the bounding box (the original geometry would not fit into a btree index). Then I've created a btree index for `(seqno, doc)`, enabling index-only scans as you suggested. This resulted in a major speed improvement (<1sec). The drawback of this approach is that the filtering is less accurate (it compares against the bounding box), but in my case this is sufficient. — JFX, Jul 29 '20 at 13:27
Sorry, I meant a btree index for `(seqno, bbox)`. The bbox column has type "geometry". — JFX, Jul 29 '20 at 14:23

score 0 · Answer 2 · answered Jul 28 '20 at 11:57

You can try to include the bounding box overlap operator ~ in the query, as the doc says

This operand will make use of any indexes that may be available on the geometries.

SELECT seqno, doc
FROM demo
WHERE ST_GeometryFromText((doc->>'geometry')) ~ ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))')
  AND ST_Within(ST_GeometryFromText((doc->>'geometry')), ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))'))
ORDER BY seqno
LIMIT 10

Otherwise, you can run the query without the limit clause and with an offset of 0 to prevent inlining the subquery, then apply the limit.

SELECT * FROM (
  SELECT seqno, doc
  FROM demo
  WHERE ST_Within(ST_GeometryFromText((doc->>'geometry')), 
                  ST_GeometryFromText('POLYGON((4.478054829251019 52.61266886732067,5.247097798001019 52.61266886732067,5.247097798001019 52.156694555984416,4.478054829251019 52.156694555984416,4.478054829251019 52.61266886732067))')
  OFFSET 0
) sub
ORDER BY seqno
LIMIT 10

I've tried both suggestions, but unfortunately they did not result in any speed improvements. Thanks anyway! — JFX, Jul 29 '20 at 13:28

Sorting a large spatial selection is not using GiST index (Postgres 11.5)

2 Answers2