Spatial query on large table with multiple self joins performing slow

Question

I am working on queries on a large table in Postgres 9.3.9. It is a spatial dataset and it is spatially indexed. Say, I have need to find 3 types of objects: A, B and C. The criteria is that B and C are both within certain distance of A, say 500 meters.

My query is like this:

select 
  school.osm_id as school_osm_id, 
  school.name as school_name, 
  school.way as school_way, 
  restaurant.osm_id as restaurant_osm_id, 
  restaurant.name as restaurant_name, 
  restaurant.way as restaurant_way, 
  bar.osm_id as bar_osm_id, 
  bar.name as bar_name, 
  bar.way as bar_way 
from (
    select osm_id, name, amenity, way, way_geo 
    from planet_osm_point 
    where amenity = 'school') as school, 
   (select osm_id, name, amenity, way, way_geo 
    from planet_osm_point 
    where amenity = 'restaurant') as restaurant, 
   (select osm_id, name, amenity, way, way_geo 
    from planet_osm_point 
    where amenity = 'bar') as bar 
where ST_DWithin(school.way_geo, restaurant.way_geo, 500, false) 
  and ST_DWithin(school.way_geo, bar.way_geo, 500, false);

This query gives me what I want, but it takes really long time, like 13 seconds to execute. I'm wondering if there is another way to write the query and make it more efficient.

Query plan:

Nested Loop  (cost=74.43..28618.65 rows=1 width=177) (actual time=33.513..11235.212 rows=10591 loops=1)
   Buffers: shared hit=530967 read=8733
   ->  Nested Loop  (cost=46.52..28586.46 rows=1 width=174) (actual time=31.998..9595.212 rows=4235 loops=1)
         Buffers: shared hit=389863 read=8707
         ->  Bitmap Heap Scan on planet_osm_point  (cost=18.61..2897.83 rows=798 width=115) (actual time=7.862..150.607 rows=8811 loops=1)
               Recheck Cond: (amenity = 'school'::text)
               Buffers: shared hit=859 read=5204
               ->  Bitmap Index Scan on idx_planet_osm_point_amenity  (cost=0.00..18.41 rows=798 width=0) (actual time=5.416..5.416 rows=8811 loops=1)
                     Index Cond: (amenity = 'school'::text)
                     Buffers: shared hit=3 read=24
         ->  Bitmap Heap Scan on planet_osm_point planet_osm_point_1  (cost=27.91..32.18 rows=1 width=115) (actual time=1.064..1.069 rows=0 loops=8811)
               Recheck Cond: ((way_geo && _st_expand(planet_osm_point.way_geo, 500::double precision)) AND (amenity = 'restaurant'::text))
               Filter: ((planet_osm_point.way_geo && _st_expand(way_geo, 500::double precision)) AND _st_dwithin(planet_osm_point.way_geo, way_geo, 500::double precision, false))
               Rows Removed by Filter: 0
               Buffers: shared hit=389004 read=3503
               ->  BitmapAnd  (cost=27.91..27.91 rows=1 width=0) (actual time=1.058..1.058 rows=0 loops=8811)
                     Buffers: shared hit=384528 read=2841
                     ->  Bitmap Index Scan on idx_planet_osm_point_waygeo  (cost=0.00..9.05 rows=137 width=0) (actual time=0.193..0.193 rows=64 loops=8811)
                           Index Cond: (way_geo && _st_expand(planet_osm_point.way_geo, 500::double precision))
                           Buffers: shared hit=146631 read=2841
                     ->  Bitmap Index Scan on idx_planet_osm_point_amenity  (cost=0.00..18.41 rows=798 width=0) (actual time=0.843..0.843 rows=6291 loops=8811)
                           Index Cond: (amenity = 'restaurant'::text)
                           Buffers: shared hit=237897
   ->  Bitmap Heap Scan on planet_osm_point planet_osm_point_2  (cost=27.91..32.18 rows=1 width=115) (actual time=0.375..0.383 rows=3 loops=4235)
         Recheck Cond: ((way_geo && _st_expand(planet_osm_point.way_geo, 500::double precision)) AND (amenity = 'bar'::text))
         Filter: ((planet_osm_point.way_geo && _st_expand(way_geo, 500::double precision)) AND _st_dwithin(planet_osm_point.way_geo, way_geo, 500::double precision, false))
         Rows Removed by Filter: 1
         Buffers: shared hit=141104 read=26
         ->  BitmapAnd  (cost=27.91..27.91 rows=1 width=0) (actual time=0.368..0.368 rows=0 loops=4235)
               Buffers: shared hit=127019
               ->  Bitmap Index Scan on idx_planet_osm_point_waygeo  (cost=0.00..9.05 rows=137 width=0) (actual time=0.252..0.252 rows=363 loops=4235)
                     Index Cond: (way_geo && _st_expand(planet_osm_point.way_geo, 500::double precision))
                     Buffers: shared hit=101609
               ->  Bitmap Index Scan on idx_planet_osm_point_amenity  (cost=0.00..18.41 rows=798 width=0) (actual time=0.104..0.104 rows=779 loops=4235)
                     Index Cond: (amenity = 'bar'::text)
                     Buffers: shared hit=25410
 Total runtime: 11238.605 ms

I'm only using one table at the moment with 1,372,711 rows. It has 73 columns:

       Column       |         Type         |       Modifiers
--------------------+----------------------+---------------------------
 osm_id             | bigint               | 
 access             | text                 | 
 addr:housename     | text                 | 
 addr:housenumber   | text                 | 
 addr:interpolation | text                 | 
 admin_level        | text                 | 
 aerialway          | text                 | 
 aeroway            | text                 | 
 amenity            | text                 | 
 area               | text                 | 
 barrier            | text                 | 
 bicycle            | text                 | 
 brand              | text                 | 
 bridge             | text                 | 
 boundary           | text                 | 
 building           | text                 | 
 capital            | text                 | 
 construction       | text                 | 
 covered            | text                 | 
 culvert            | text                 | 
 cutting            | text                 | 
 denomination       | text                 | 
 disused            | text                 | 
 ele                | text                 | 
 embankment         | text                 | 
 foot               | text                 | 
 generator:source   | text                 | 
 harbour            | text                 | 
 highway            | text                 | 
 historic           | text                 | 
 horse              | text                 | 
 intermittent       | text                 | 
 junction           | text                 | 
 landuse            | text                 | 
 layer              | text                 | 
 leisure            | text                 | 
 lock               | text                 | 
 man_made           | text                 | 
 military           | text                 | 
 motorcar           | text                 | 
 name               | text                 | 
 natural            | text                 | 
 office             | text                 | 
 oneway             | text                 | 
 operator           | text                 | 
 place              | text                 | 
 poi                | text                 | 
 population         | text                 | 
 power              | text                 | 
 power_source       | text                 | 
 public_transport   | text                 | 
 railway            | text                 | 
 ref                | text                 | 
 religion           | text                 | 
 route              | text                 | 
 service            | text                 | 
 shop               | text                 | 
 sport              | text                 | 
 surface            | text                 | 
 toll               | text                 | 
 tourism            | text                 | 
 tower:type         | text                 | 
 tunnel             | text                 | 
 water              | text                 | 
 waterway           | text                 | 
 wetland            | text                 | 
 width              | text                 | 
 wood               | text                 | 
 z_order            | integer              | 
 tags               | hstore               | 
 way                | geometry(Point,4326) | 
 way_geo            | geography            | 
 gid                | integer              | not null default nextval('...
Indexes:
    "planet_osm_point_pkey1" PRIMARY KEY, btree (gid)
    "idx_planet_osm_point_amenity" btree (amenity)
    "idx_planet_osm_point_waygeo" gist (way_geo)
    "planet_osm_point_index" gist (way)
    "planet_osm_point_pkey" btree (osm_id)

There are 8811, 6291, 779 rows in amenity school, restaurant and bar respectively.

Since you are cross joining the same table 3 times, you are probably getting 3 or more results of the same records. How many records are in table1 and how many records do you expect to get in return with your matches? That can tell us if 13 seconds is too long. Some how without a complete join syntax, it is hard to believe that "this query gives me what I want". — Emacs User, Jul 17 '15 at 01:03
Your edit improves things, but you did not go all the way. Essential details are still missing. Consider **[instructions here](http://stackoverflow.com/tags/postgresql-performance/info)**. Click the link and read it. Plus, in particular: how many rows (roughly) with type `A`, `B`, `C`? Provide the output of `\d table1` in psql (without irrelevant columns), output of `EXPLAIN (ANALYZE, BUFFERS)` and all relevant indexes. — Erwin Brandstetter, Jul 17 '15 at 22:26
@ErwinBrandstetter I changed my query. The table is sparse. If I create sub tables for example, I create a table that only has record with none null value for amenity, etc., it would speed up the process. but the problem is I have 73 columns and I probably need to create around 70 new tables... — al3xtouch, Jul 20 '15 at 02:35
@ErwinBrandstetter The original table only has way, which is a geometry, but I need to use meters in the buffer so I created a geography column way_geo, but eventually I need to latitude and longitude, so I guess display either geometry or geography is fine. the query could be any thing, or even more than three things (two things minimum), and each one could be any of the columns, except for the id, primary key, geography and geometry. Yes, the table is very sparse, many of them are null, but I think there is at least one entry for each column — al3xtouch, Jul 20 '15 at 04:27
"any of the columns" ... If it was just the column `amenity` I would have some ideas to make this fast, but any of about 70 columns is very hard to index. I think you have a DB design problem, like you already hinted yourself. Another issue: what are you looking for in the result *exactly*? The current query returns a *cross join* of all restaurants and bars close enough to the same school. For a school with 10 rest. & 10 bars nearby, you get 100 rows. If you combine 2 more items with 10 hits each, that's already 10.000 rows. Is that really what you want? — Erwin Brandstetter, Jul 20 '15 at 04:38
I think you really want a list of schools that have at least one bar and one restaurant nearby, possibly followed by a list of qualifying restaurants and bars (not a Cartesian product of all finds) for each. — Erwin Brandstetter, Jul 20 '15 at 04:45
@ErwinBrandstetter yes, for this example, I want a list of schools that have restaurants and bars within 500 meters and I need the coordinates of each school and its corresponding restaurants and bars. But the query is not confined to amenity, for instance I could also use highway, so it could be amenity='school', highway='bus_stop', so I'm finding all the schools that has bus_stops within 500 meters. How could I rewrite the query? or do you have any suggestions for the database design? — al3xtouch, Jul 20 '15 at 04:59
Is `osm_id` *unique*? Or do we need `gid` to uniquely identify a row? — Erwin Brandstetter, Jul 20 '15 at 05:29

score 4 · Accepted Answer · edited May 23 '17 at 12:17

This query should go a long way (be much faster):

WITH school AS (
   SELECT s.osm_id AS school_id, text 'school' AS type, s.osm_id, s.name, s.way_geo
   FROM   planet_osm_point s
        , LATERAL (
      SELECT  1 FROM planet_osm_point
      WHERE   ST_DWithin(way_geo, s.way_geo, 500, false)
      AND     amenity = 'bar'
      LIMIT   1  -- bar exists -- most selective first if possible
      ) b
        , LATERAL (
      SELECT  1 FROM planet_osm_point
      WHERE   ST_DWithin(way_geo, s.way_geo, 500, false)
      AND     amenity = 'restaurant'
      LIMIT   1  -- restaurant exists
      ) r
   WHERE  s.amenity = 'school'
   )
SELECT * FROM (
   TABLE school  -- schools

   UNION ALL  -- bars
   SELECT s.school_id, 'bar', x.*
   FROM   school s
        , LATERAL (
      SELECT  osm_id, name, way_geo
      FROM    planet_osm_point
      WHERE   ST_DWithin(way_geo, s.way_geo, 500, false)
      AND     amenity = 'bar'
      ) x

   UNION ALL  -- restaurants
   SELECT s.school_id, 'rest.', x.*
   FROM   school s
        , LATERAL (
      SELECT  osm_id, name, way_geo
      FROM    planet_osm_point
      WHERE   ST_DWithin(way_geo, s.way_geo, 500, false)
      AND     amenity = 'restaurant'
      ) x
   ) sub
ORDER BY school_id, (type <> 'school'), type, osm_id;

This is not the same as your original query, but rather what you actually want, as per discussion in comments:

I want a list of schools that have restaurants and bars within 500 meters and I need the coordinates of each school and its corresponding restaurants and bars.

So this query returns a list of those schools, followed by bars and restaurants nearby. Each set of rows is held together by the osm_id of the school in the column school_id.

Now using LATERAL joins, to make use of the spatial GiST index.

TABLE school is just shorthand for SELECT * FROM school:

Is there a shortcut for SELECT * FROM in psql?

The expression (type <> 'school') orders the school in each set first, because:

SQL select query order by day and month

The subquery sub in the final SELECT is only needed to order by this expression. A UNION query limits an attached ORDER BY list to only columns, no expressions.

I focus on the query you presented for the purpose of this answer - ignoring the extended requirement to filter on any of the other 70 text columns. That's really a design flaw. The search criteria should be concentrated in few columns. Or you'll have to index all 70 columns, and multicolumn indexes like I am going to propose are hardly an option. Still possible though ...

Index

In addition to the existing:

"idx_planet_osm_point_waygeo" gist (way_geo)

If always filtering on the same column, you could create a multicolumn index covering the few columns you are interested in, so index-only scans become possible:

CREATE INDEX planet_osm_point_bar_idx ON planet_osm_point (amenity, name, osm_id)

Postgres 9.5

The upcoming Postgres 9.5 introduces major improvements that happen to address your case exactly:

Allow queries to perform accurate distance filtering of bounding-box-indexed objects (polygons, circles) using GiST indexes (Alexander Korotkov, Heikki Linnakangas)

Previously, a common table expression was required to return a large number of rows ordered by bounding-box distance, and then filtered further with a more accurate non-bounding-box distance calculation.

Allow GiST indexes to perform index-only scans (Anastasia Lubennikova, Heikki Linnakangas, Andreas Karlsson)

That's of particular interest for you. Now you can have a single multicolumn (covering) GiST index:

CREATE INDEX reservations_range_idx ON reservations
USING gist(amenity, way_geo, name, osm_id)

And:

Improve bitmap index scan performance (Teodor Sigaev, Tom Lane)

And:

Add GROUP BY analysis functions GROUPING SETS, CUBE and ROLLUP (Andrew Gierth, Atri Sharma)

Why? Because ROLLUP would simplify the query I suggested. Related answer:

Grouping() equivalent in PostgreSQL?

The first alpha version has been released on July 2, 2015. The expected timeline for the release:

This is the alpha release of version 9.5, indicating that some changes to features are still possible before release. The PostgreSQL Project will release 9.5 beta 1 in August, and then periodically release additional betas as required for testing until the final release in late 2015.

Basics

Of course, be sure not to overlook the basics:

Slow Query Questions page on the PostgreSQL Wiki

I changed b.amenity to amenity. I got this error: ERROR: ERROR: invalid UNION/INTERSECT/EXCEPT ORDER BY clause LINE 32: ORDER BY school_id, (type <>'school'), type, osm_id; ^ DETAIL: Only result column names can be used, not expressions or functions. HINT: Add the expression/function to every SELECT, or move the UNION into a FROM clause. — al3xtouch, Jul 20 '15 at 05:48
@al3xtouch: Fixed that bug. It was the same as here: http://stackoverflow.com/a/31211776/939860. (I wrote this query from the top of my head without testing) — Erwin Brandstetter, Jul 20 '15 at 05:59
Under the second UNION ALL, x should be b right? I still got the error that points to the last line, order by, it points to the word 'type', and it says DETAIL: Only result column names can be used, not expressions or functions. HINT: Add the expression/function to every SELECT, or move the UNION into a FROM clause. — al3xtouch, Jul 20 '15 at 06:06
it gives me the same error, pointing to the 'type' in (type <> 'school'), with the same detail and hint... — al3xtouch, Jul 20 '15 at 06:21
Thanks for the answer, I got messages while running the query, "NOTICE: gserialized_gist_joinsel: jointype 4 not supported" for times, do you know what does it mean? — al3xtouch, Jul 20 '15 at 20:57
@al3xtouch: Obviously means we have to rewrite one of the joins to make it use the GiST index: http://trac.osgeo.org/postgis/ticket/2342 Anyway: Is it any faster so far? (obviously it can be faster, still) — Erwin Brandstetter, Jul 20 '15 at 21:04
yes, it's now 9 seconds. Patrick's answer is slightly faster, but also around 9 seconds. Thank you very much for the answer and explanation and suggestions. — al3xtouch, Jul 20 '15 at 21:19
@al3xtouch: Do you get any notices when only running the query inside the CTE school (without all the rest)? — Erwin Brandstetter, Jul 20 '15 at 21:25
Did I get that right that you only want schools where *at least* one bar and *at least* one rest. is nearby, not just one of them? Also, is the updated query with `LATERAL` faster (fewer notices?) — Erwin Brandstetter, Jul 20 '15 at 21:38
yes, this one is faster, 8 seconds, but I still got the same messages, four notices. yes I want schools where at least one bar and at least one rest. is nearby, so Patrick's answer may give me results I do not want. — al3xtouch, Jul 20 '15 at 21:46
@al3xtouch: OK, I am pretty sure spatial indexes do not support the `EXISTS` semi-join I had in the CTE (while those are typically dynamite in other contexts). inner joins (including the lateral) are supported. Do you still see notices? You could combine the new CTE with the original rest, but I guess all lateral is fastest ... — Erwin Brandstetter, Jul 20 '15 at 22:25

Patrick · Answer 2 · 2015-07-20T05:27:16.007

1

The 3 sub-selects that you use are very inefficient. Write them as LEFT JOIN clauses and the query should be much more efficient:

SELECT
  school.osm_id AS school_osm_id, 
  school.name AS school_name, 
  school.way AS school_way, 
  restaurant.osm_id AS restaurant_osm_id, 
  restaurant.name AS restaurant_name, 
  restaurant.way AS restaurant_way, 
  bar.osm_id AS bar_osm_id, 
  bar.name AS bar_name, 
  bar.way AS bar_way 
FROM planet_osm_point school
LEFT JOIN planet_osm_point restaurant ON restaurant.amenity = 'restaurant' AND
                               ST_DWithin(school.way_geo, restaurant.way_geo, 500, false) 
LEFT JOIN planet_osm_point bar ON bar.amenity = 'bar' AND
                               ST_DWithin(school.way_geo, bar.way_geo, 500, false)
WHERE school.amenity = 'school'
  AND (restaurant.osm_id IS NOT NULL OR bar.osm_id IS NOT NULL);

But this will give too many results if you have multiple restaurants and bars per school. You can simplify the query like this:

SELECT
  school.osm_id AS school_osm_id, 
  school.name AS school_name, 
  school.way AS school_way, 
  a.osm_id AS amenity_osm_id, 
  a.amenity AS amenity_type,
  a.name AS amenity_name, 
  a.way AS amenity_way, 
FROM planet_osm_point school
JOIN planet_osm_point a ON ST_DWithin(school.way_geo, a.way_geo, 500, false) 
WHERE school.amenity = 'school'
  AND a.amenity IN ('bar', 'restaurant');

This will give every bar and restaurant for each school. Schools without either restaurant or bar within 500m are not listed.

edited Jul 20 '15 at 05:27

answered Jul 17 '15 at 01:48

Patrick

29,357
6
62
90

We *can* put *any* valid expression into the join clause. This burns down to the same query plan as the other two answers. – Erwin Brandstetter Jul 17 '15 at 01:51
I know that we *can*, but that doesn't mean that we *should*. `JOIN ... ON b.type = 'B'` does not join anything to anything at all. – Patrick Jul 17 '15 at 01:54
Not saying we *should*. Queries should make sense to the human eye as well. Still doesn't make a difference for the query plan. Here's why: http://stackoverflow.com/a/27420244/939860 - with authoritative quotes form and links to the manual, so you don't have to take my word for it. – Erwin Brandstetter Jul 17 '15 at 02:04
Thanks for the answer, I tried your query, it does not really make any difference – al3xtouch Jul 20 '15 at 02:13
@Patrick I tried this query, but it looks like this one gives more results and some of them are not correct. because for some results, it only has school id, name and way, the restaurant and bar are null, could you check the query again? – al3xtouch Jul 20 '15 at 04:33
Added another `WHERE` condition that requires that at least restaurant or bar data is available; see last line of query. – Patrick Jul 20 '15 at 05:16
And a new edit to eliminate doubles and make the query more flexible – Patrick Jul 20 '15 at 05:28

score 0 · Answer 3 · answered Jul 17 '15 at 01:15

0

Does it make any difference if you use explicit joins?

SELECT a.id as a_id, a.name as a_name, a.geog as a_geog,
       b.id as b_id, b.name as b_name, b.geog as b_geog,
       c.id as c_id, c.name as c_name, c.geog as c_geog
FROM table1 a
JOIN table1 b ON b.type = 'B' AND ST_DWithin(a.geog, b.geog, 100)
JOIN table1 c ON c.type = 'C' AND ST_DWithin(a.geog, c.geog, 100)
WHERE a.type = 'A';

answered Jul 17 '15 at 01:15

Mike T

41,085
18
152
203

It shouldn't make a difference to Postgres. Conditions for plain inner joins and WHERE conditions are all treated the same. – Erwin Brandstetter Jul 17 '15 at 01:18
it makes no difference – al3xtouch Jul 20 '15 at 02:17

Emacs User · Answer 4 · 2015-07-17T01:32:42.343

0

Try this with inner join syntax and compare the results, there should be no duplicates. My guess is it should take 1/3rd the time or better than the original query :

select a.id as a_id, a.name as a_name, a.geog as a_geo,
       b.id as b_id, b.name as b_name, b.geog as b_geo,
       c.id as c_id, c.name as c_name, c.geog as c_geo
from table1 as a
INNER JOIN table1 as b on b.type='B'
INNER JOIN table1 as c on c.type='C'
WHERE a.type='A' and
     (ST_DWithin(a.geo, b.geo, 100) and ST_DWithin(a.geo, c.geo, 100))

edited Jul 17 '15 at 01:32

answered Jul 17 '15 at 01:19

Emacs User

1,457
1
12
19

Now it works, but it burns down to the same query plan as the original query or what Mike already posted. The OP really needs to provide more input. – Erwin Brandstetter Jul 17 '15 at 01:47
@ErwinBrandstetter, I agree. – Emacs User Jul 17 '15 at 04:14
it does not make any difference – al3xtouch Jul 20 '15 at 02:16

Spatial query on large table with multiple self joins performing slow

4 Answers4

Index

Postgres 9.5

Basics

Linked