4

I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?

select  A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total 
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND  100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06'  AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
Chris Cashwell
  • 22,308
  • 13
  • 63
  • 94
  • 2
    How many rows are returned by the query? What is the output when you run the query through EXPLAIN ANALYSE in each case? – Dana the Sane May 03 '11 at 17:53
  • 2
    First step in fixing the problem is taking look at explain analyze output for *both* versions of the query. If you can't read it (it's not very userfriendly) put it on http://explain.depesz.com/ and give us links to both plans. –  May 03 '11 at 18:10
  • The result of this specific query is only 11 rows. The output from the Explain Analyze WITHOUT the `order by` is basically identical to that of the one WITH, minus the 117117 ms being spent sorting the rows. – Chris Cashwell May 03 '11 at 18:21
  • 2
    Clearly there is something else going on, since it doesn't take 2 minutes to sort 11 rows! Again, please show us both explain analyze outputs. –  May 03 '11 at 19:45
  • Have the tables been analyzed? Use ANALYZE; or VACUUM ANALYZE: to do so. This will collect statistics for the planner to determine the best execution plan for your query. Also have you tuned your postgresql.conf for your machine? – Eelke May 03 '11 at 20:20

7 Answers7

5

The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.

select count(*) from (1 million row table)               -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x;          -- 1.80 s

So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.

OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :

where (A.power_peak between 1.0 AND  100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06'  AND A.fips_country = 'US' AND A.fips_level2 = '025')

and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'

OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.

Then we got the worst case CROSS JOIN condition ever :

and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))

This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.

Of course it's horribly slow !

When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.

Here's a little example :

Tables a and b are identical and contain 1000 rows, and a column of type BOX.

select * from a cross join b where (a.b && b.b)     --- 0.28 s

Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.

create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b)     --- 0.01 s

Here the index is used to optimize the cross join, and speed is ridiculous.

You need to optimize that geometry matching.

  • add columns which will cache :
    • ST_Centroid(A.wkb_geometry)
    • ST_Buffer((B.wkb_geometry), 1000)

There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.

  • add columns of type BOX which will cache :

    • Bounding Box of ST_Centroid(A.wkb_geometry)
    • Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
  • add gist indexes on the BOXes

  • add a Box overlap test (using the && operator) which will use the index

  • keep your ST_Within which will act as a final filter on the rows that pass

Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :

http://www.postgresql.org/docs/8.2/static/functions-geometry.html

bobflux
  • 11,123
  • 3
  • 27
  • 27
2

I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html

I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.

You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index

Community
  • 1
  • 1
Cole W
  • 15,123
  • 6
  • 51
  • 85
  • The table in question is already indexed `CREATE INDEX global_site_area_acre_idx ON global_site USING btree (area_acre DESC);` -- is there a better way to do it? – Chris Cashwell May 03 '11 at 19:22
1

If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.

pyasi
  • 176
  • 7
0

Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?

Craig Tullis
  • 9,939
  • 2
  • 21
  • 21
0

First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.

grantk
  • 3,958
  • 4
  • 28
  • 37
0

First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:

CREATE INDEX area_acre ON global_site(area_acre)

As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.

AvatarKava
  • 15,245
  • 2
  • 27
  • 33
  • That field is indeed indexed. The indexing was done like `CREATE INDEX global_site_area_acre_idx ON global_site USING btree (area_acre DESC);`. Is there anything wrong with this, and/or can it be tweaked? – Chris Cashwell May 03 '11 at 18:23
0

I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.

Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.

One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.

Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.

Mark Wilkins
  • 40,729
  • 5
  • 57
  • 110