2

I'm running a query to determine the number of points in a polygon, for each polygon in a table. the point layer (twitter) has 500m rows and the polygon layer (us_urbanareas) has ~3000 rows. The point layer's geometry is indexed.

the query looks like:

EXPLAIN SELECT us_urbanareas.name, us_urbanareas.gid, count(twitter.coords) AS total
FROM us_urbanareas LEFT JOIN twitter
ON st_contains(ST_Transform(us_urbanareas.geom,4326),twitter.coords)
GROUP BY us_urbanareas.gid LIMIT 500

and the explain looks like:

"Limit  (cost=1265.59..47875481.71 rows=500 width=45)"
"  ->  GroupAggregate  (cost=1265.59..342780653.01 rows=3580 width=45)"
"        ->  Nested Loop Left Join  (cost=1265.59..340247956.29 rows=506532183 width=45)"
"              ->  Index Scan using us_urbanareas_pkey on us_urbanareas  (cost=0.28..1000.18 rows=3580 width=4028)"
"              ->  Bitmap Heap Scan on twitter  (cost=1265.31..94899.56 rows=14149 width=32)"
"                    Recheck Cond: (st_transform(us_urbanareas.geom, 4326) && coords)"
"                    Filter: _st_contains(st_transform(us_urbanareas.geom, 4326), coords)"
"                    ->  Bitmap Index Scan on coord_gist  (cost=0.00..1261.77 rows=42447 width=0)"
"                          Index Cond: (st_transform(us_urbanareas.geom, 4326) && coords)"

The query works, but is fairly slow. the error message i receive is:

NOTICE:  gserialized_gist_joinsel: jointype 1 not supported

The db server that I'm working on has loads of memory (96gb), and we've just been through a process of optimizing its configuration to make the most of it. I'm running PostGres 9.3.2

Any thoughts on how to optimize the query and get around the jointype not supported?

Update:

updated the SRID's of the us_urbanareas table to remove the transform 2. removed the left join and re-wrote query as per suggestion below 3. tried turning bitmapscan = off

turning off the bitmapscan drastically reduces the cost in the explain, but increases the query time (tested using a few random polys) twofold.

"Limit  (cost=0.83..50601686.53 rows=500 width=45)"
"  ->  GroupAggregate  (cost=0.83..362308070.41 rows=3580 width=45)"
"        ->  Nested Loop  (cost=0.83..359737718.66 rows=514063189 width=45)"
"              ->  Index Scan using us_urbanareas_pkey on us_urbanareas us  (cost=0.28..1002.76 rows=3580 width=45)"
"              ->  Index Scan using coord_gist on twitter tw  (cost=0.55..100341.53 rows=14359 width=32)"
"                    Index Cond: (us.geom && coords)"
"                    Filter: _st_contains(us.geom, coords)"

update 4. executing

select * from pg_stat_activity;

results with,

"INSERT INTO twitter(tweet,name,handle,location,coords,time) VALUES ($$T :($$,$$anissa aguilar$$,$$babyniss_$$,$$Abq/Rio, NM $$,ST_GeomFromText('POINT(-106.659914 35.23192)', (...)"
"SELECT version();"
"SELECT t.oid, t.xmin, t.*, relname, CASE WHEN relkind = 'r' THEN TRUE ELSE FALSE END AS parentistable,   nspname, des.description, l.lanname, p.prosrc, 
  substring(pg_get_triggerdef(t.oid), 'WHEN (.*) EXECUTE PROCEDURE') AS whenclause
  FROM pg_trigger t
 (...)"
"select * from pg_stat_activity;"
" SELECT us.name, us.gid, count(tw.coords) AS total
   FROM us_urbanareas us,  twitter tw
WHERE st_contains(us.geom, tw.coords)
   GROUP BY us.gid;"
mikedotonline
  • 35
  • 1
  • 6
  • Can you tell what indexes you have on those tables? And explain analyze might be better. – Jakub Kania Aug 13 '14 at 13:57
  • @JakubKania. He definitely has a spatial index, you can see that in the final line. And there is definitely an issue with using left join in this kind of query, as linked below: http://trac.osgeo.org/postgis/ticket/2342 – John Powell Aug 13 '14 at 14:48
  • @JohnBarça I was thinking maybe index on ST_Transform would help. – Jakub Kania Aug 13 '14 at 19:25
  • @JakubKania. I was suggesting removing the ST_Transform altogether. I didn't realize you could index ST_Transform, but I agree, maybe worth indexing both geometries, though in general, as us_urbanareas has so few rows, a full scan will be as fast anyway, I would think. – John Powell Aug 13 '14 at 19:36
  • very cool technique! however, as per the suggestion below, I switched the SRID of us_urbanareas to the same SRID as twitter. Unfortunately not much improvement – mikedotonline Aug 13 '14 at 21:00

2 Answers2

5

Try using the Cartesian product syntax, ie, full join, which in Postgres can be simplified to just a comma. I know this sounds counterintuitive, but the spatial index on the larger table will take care of which polygons are actually included via ST_Contains. I personally find the notion of a LEFT JOIN when you are dealing with spatial intersections to be unnatural, and this becomes more obvious when you have multiple spatial joins, as there is no natural correlate with left or right joins, when you are dealing with intersections/containment in a two-dimensional space.

EXPLAIN SELECT us.name, us.gid, count(tw.coords) AS total
   FROM us_urbanareas us,  twitter tw
WHERE st_contains(ST_Transform(us.geom,4326),tw.coords)
   GROUP BY us.gid LIMIT 500;

Another thing you can do that might speed things up a little is to convert the us_urbanareas to 4326 before doing the query. It might not make a difference on such a small table, but in general, if you can avoid transforming fields that are part of the join it will help the optimizer and be one less operation to perform -- it can make a very big difference on joins involving tables with many rows on both sides. You can do this by using the UpdateGeometrySrid function, which will also update the metadata view, geometry_columns.

For example, in your case:

Select UpdateGeometrySrid('us_urbanareas', 'geom', 4326);

After which, your query will simplify to:

SELECT us.name, us.gid, count(tw.coords) AS total
   FROM us_urbanareas us,  twitter tw
WHERE st_contains(us.geom, tw.coords)
   GROUP BY us.gid LIMIT 500;

I assume you have seen this, but there is an exchange between some of the Postgis devs about this exact message and how the left join syntax can "confuse" the Postgres query optimizer, so that even though the query runs, it might have a suboptimal plan, exactly as you have stated in your question. See, http://trac.osgeo.org/postgis/ticket/2342

EDIT: While I suspect that removing the ST_Transform and rewriting the join condition will both get rid of the error and speed up the query, there are other things you can do to force the optimizer to choose certain plans. It is a bit of a black art, and generally not recommended, but spatial queries are a bit different in that the complexity of the geometrical operations often greatly outweighs the cost of fetching from disk -- which is, in essence, what explain metrics units are measured in. You can try turning on or off various join types, such as seq_scan or hash_join, see query config variables, eg,

SET enable_seqscan=off;

But, I repeat, this is something of a black art and generally the optimizer "knows" best, but I do have experience of cutting query times by orders of magnitude by turning off various join types.

EDIT 2: having looked again at your query, I am puzzled by the need for the bitmap heapscan in there. I just ran a similar query, which finds the area of intersection of a grid of tiles, containing density counts, and some polygons, and groups by the polygon ID, so very analogous. My query looks like:

SELECT sum(ceil(st_area(st_intersection(ww.geom, dens.geom))/1000000 * osmm_add)), ww.gid 
FROM density_2010 dens, Wales_west_al2 ww 
WHERE st_intersects(dens.geom, ww.geom) GROUP BY ww.gid;

and my explain,

GroupAggregate  (cost=0.00..122642.68 rows=52429 width=552)
  ->  Nested Loop  (cost=0.00..108093.63 rows=52429 width=552)
       Join Filter: _st_intersects(dens.geom, ww.geom)
         ->  Index Scan using wales_west_al2_pkey on wales_west_al2 ww  (cost=0.00..6541.84 rows=124173 width=36)
         ->  Index Scan using ix_spatial_os_density on density_2010 dens  (cost=0.00..0.56 rows=1 width=516)
               Index Cond: (geom && ww.geom)

This is similar to yours, except for the absence of bitmap heap scans. Could you try setting:

set enable_bitmapscan=off;

and see if that makes any difference.

John Powell
  • 12,253
  • 6
  • 59
  • 67
  • I've taken a few steps to speed things up, as per your suggestions. 1. updated the SRID of us_urbanareas to 4326 2. added a spatial index to us_urbanareas 3. changed the query to remove the left join the explain however is actually higher cost. 1275. (next comment for explain) – mikedotonline Aug 13 '14 at 19:05
  • Any noticeable difference so far? – John Powell Aug 13 '14 at 19:07
  • "Limit (cost=1275.34..48515419.07 rows=500 width=45)" " -> GroupAggregate (cost=1275.34..347362544.46 rows=3580 width=45)" " -> Nested Loop (cost=1275.34..344807953.63 rows=510911006 width=45)" " -> Index Scan using us_urbanareas_pkey on us_urbanareas us (cost=0.28..1001.52 rows=3580 width=45)" – mikedotonline Aug 13 '14 at 19:09
  • " -> Bitmap Heap Scan on twitter tw (cost=1275.06..96172.08 rows=14271 width=32)" " Recheck Cond: (us.geom && coords)" " Filter: _st_contains(us.geom, coords)" " -> Bitmap Index Scan on coord_gist (cost=0.00..1271.49 rows=42814 width=0)" " Index Cond: (us.geom && coords)" – mikedotonline Aug 13 '14 at 19:10
  • That is a bit hard to read :D. The error message has disappeared, presumably? Explain can be a bit misleading with spatial queries. I have run queries recently where the explains were identical, but one ran 10 times quicker than the other, just by using ST_Dump to convert a MultiPolygon into its constituent polygons. The proof will be in the run time. – John Powell Aug 13 '14 at 19:14
  • Thanks, sorry about the paste job - hard in the comments! Running a query now to test. fingers crossed. After this i'll try some of the more black arts. – mikedotonline Aug 13 '14 at 19:24
  • How long did your original query take, as a matter of interest? – John Powell Aug 13 '14 at 19:36
  • left the limit 500 going to 36 hours+, no result. Single poly's appear to take 280,000ms (4.66min) each. Now running a LIMIT 10 to see if it takes 50min -- although i'm pretty sure it is caching results, so it might be off a bit for the counts its already done. – mikedotonline Aug 13 '14 at 19:43
  • first ten records = 63.36min – mikedotonline Aug 13 '14 at 20:27
  • So, no improvement then. I'm curious about the bitmap heap scan, see http://stackoverflow.com/questions/6592626/bitmap-heap-scan for an explanation. I can't see any reason why that should be there, but is does suggest you have a table scan going on, when the spatial index should be being used. can you try doing, set enable_bitmapscan=off and doing another explain. – John Powell Aug 14 '14 at 06:02
  • Updated answer with a similar query and explain and another suggestion. – John Powell Aug 14 '14 at 09:38
  • tried turning off the bitmapscan, explain and details above – mikedotonline Aug 14 '14 at 17:46
  • How odd. Is your table auto vacuuming or some such? What does select * from pg_stat_activity; show. – John Powell Aug 14 '14 at 19:06
  • updated above. only ongoing query i don't understand is the one starting: SELECT t.oid, t.xmin. I assume that updates the index? – mikedotonline Aug 14 '14 at 19:53
  • So, this is a live db with triggers and inserts going on? That might make getting coherent performance numbers harder. I am very surprised that the explain could improve and the performance deteriorate, all other things being equal. – John Powell Aug 14 '14 at 21:00
  • yeah, its an 'always-on' system, pulling in data. the change in the explain was quite drastic though, from ~1200, to 0.06, so it makes me think its not exactly a reliable estimate. – mikedotonline Aug 15 '14 at 18:11
  • No kidding, as I said disk cost not polygon complexity is what you get from explain. You can also get some big improvements using ST_Simplify on your polygons if you are prepared to sacrifice a tiny bit of accuracy. – John Powell Aug 15 '14 at 18:37
  • Hm. I don't think your recommendation to switch to the comma syntax is equivalent. `LEFT JOIN` means that rows on the left table will not be removed if they have no match. A "full join" with the join condition in the `WHERE` clause is equivalent to an inner join, and PostgreSQL will optimize it as such. If you really need a `LEFT JOIN`, this won't give the same result set; rows from `us_urbanareas` that have no matching row in `twitter` will be removed. – jpmc26 Dec 12 '16 at 19:06
  • @jpmc26. Yes, I am aware that left join will return all rows to the left, irrespective of intersection. But, that is not what the OP asked for. When you are trying to find all those geometries from table a that intersect with all those in table b, the full join (comma syntax) is clearer, because the work is being done by the ST_Intersects function. – John Powell Dec 12 '16 at 19:24
1

I know it's quite old, but since I had a very similar problem I want to share my solution. What I did was to change the datamodel further changing the query. So my simple solution was to add the ID of the geometric model as an additional column (foreign key) in the tweets' table.

It's specially worth it if you will use this kind of queries many times in your application, in such case it's easier just to complete once at the beginning or periodically update the new column and then make the joins based on the id of the us_urbanareas.

user3484623
  • 111
  • 1
  • 4