1

I am comparing queries on PostgreSQL 8.3.14 which return the same result set.

I have used EXPLAIN on my queries to track the estimated total cost. I have also run the queries a few times and recorded the total time it took to run. I understand consecutive runs will cause more data to be cached and skew the actual no-cache runtime.

Still I would expect that EXPLAIN cost would be somewhat proportional to the total runtime (with cache skew).

My data denies this. I compared 4 queries.

  1. Query A
    • Total Cost: 119 500
    • Average Runtime: 28.101 seconds
  2. Query B
    • Total Cost: 115 700
    • Average Runtime: 28.291 seconds
  3. Query C
    • Total Cost: 116 200
    • Average Runtime: 32.409 seconds
  4. Query D
    • Total Cost: 93 200
    • Average Runtime: 37.503 seconds

I ran Query D last and if anything it should be the fastest because of the caching problem. Since running the queries without cache seems to be difficult based on this Q+A:

[SO]:See and clear Postgres caches/buffers?

How can I measure which query is the most efficient?

Community
  • 1
  • 1
Oliver
  • 421
  • 3
  • 6

2 Answers2

1

The query cost shown by the planner is a function of the structure of your indexes and also the relative frequencies of certain values in the relevant tables. PostgreSQL keeps track of the most common values seen in all of the columns of all of your tables so that it can get an idea of how many rows each stage of each plan is likely to operate on.

This information can become out of date. If you are really trying to get an accurate idea of how costly a query is, make sure that the statistics postgres is using is up to date, by executing a VACUUM ANALYZE statement.

Beyond that, the planner is forced to do some apples to oranges comparisons; somehow comparing the time it takes to seek versus the time it takes to run a tight loop over an in-memory relation. Since different hardware can do these things at different relative speeds, sometimes, especially for near ties, postgres may guess wrong. These relative costs can be tuned in the configuration of your server's config file

Edit: The statistics collected by postgesql do not relate to "query performance" and are not updated by successive queries. They only describe the frequency and distribution of values in each column of each table (except where disabled.) Having accurate statistics is important for accurate query planning, but its on you, the operator, to tell PostgreSQL how often and to what level of detail those statstics should be gathered. The discrepency you are observing is a sign that the stastics are out of date, or that you could benefit from tuning other planner parameters.

SingleNegationElimination
  • 151,563
  • 33
  • 264
  • 304
  • The docs say `VACUUM ANALYZE [table]` will update statistics for all the tables I include. It sounds to me that this will skew the query comparison because each successive run will have better statistics. – Oliver Mar 05 '12 at 14:24
0

Try running them through explain analyze and posting the output from that to http://explain.depesz.com/

Scott Marlowe
  • 8,490
  • 3
  • 23
  • 21
  • I've done that, however, that doesn't help me quantitatively determine which query is the most efficient. It makes it easier to find problem areas but not compare query efficiency. – Oliver Mar 05 '12 at 14:30