25

According to this page in the manual, indexes don't need to be maintained. However, we are running with a PostgresQL table that has a continuous rate of updates, deletes and inserts that over time (a few days) sees a significant query degradation. If we delete and recreate the index, query performance is restored.

We are using out of the box settings.
The table in our test is currently starting out empty and grows to half a million rows. It has a fairly large row (lots of text fields).

We are searching based of an index, not the primary key (I've confirmed the index is being used, at least under normal conditions)

The table is being used as a persistent store for a single process. Using PostgresQL on Windows with a Java client.

I'm willing to give up insert and update performance to keep up the query performance.

We are considering rearchitecting the application so that data is spread across various dynamic tables in a manner that allows us to drop and rebuild indexes periodically without impacting the application. However, as always, there is a time crunch to get this to work and I suspect we are missing something basic in our configuration or usage.

We have considered forcing vacuuming and rebuild to run at certain times, but I suspect the locking period for such an action would cause our query to block. This may be an option, but there are some real-time (windows of 3-5 seconds) implications that require other changes in our code.

Additional information: Table and index

CREATE TABLE icl_contacts
(
  id bigint NOT NULL,
  campaignfqname character varying(255) NOT NULL,
  currentstate character(16) NOT NULL,
  xmlscheduledtime character(23) NOT NULL,
...
25 or so other fields.  Most of them fixed or varying character fiel  
...
  CONSTRAINT icl_contacts_pkey PRIMARY KEY (id)
)
WITH (OIDS=FALSE);
ALTER TABLE icl_contacts OWNER TO postgres;

CREATE INDEX icl_contacts_idx
  ON icl_contacts
  USING btree
  (xmlscheduledtime, currentstate, campaignfqname);

Analyze:

Limit  (cost=0.00..3792.10 rows=750 width=32) (actual time=48.922..59.601 rows=750 loops=1)
  ->  Index Scan using icl_contacts_idx on icl_contacts  (cost=0.00..934580.47 rows=184841 width=32) (actual time=48.909..55.961 rows=750 loops=1)
        Index Cond: ((xmlscheduledtime < '2010-05-20T13:00:00.000'::bpchar) AND (currentstate = 'SCHEDULED'::bpchar) AND ((campaignfqname)::text = '.main.ee45692a-6113-43cb-9257-7b6bf65f0c3e'::text))

And, yes, I am aware there there are a variety of things we could do to normalize and improve the design of this table. Some of these options may be available to us.

My focus in this question is about understanding how PostgresQL is managing the index and query over time (understand why, not just fix). If it were to be done over or significantly refactored, there would be a lot of changes.

SkyWalker
  • 28,384
  • 14
  • 74
  • 132
Jim Rush
  • 4,143
  • 3
  • 25
  • 27
  • 3
    What version do you use? When you abuse VACUUM FULL, you're bloating your indexes and you have to do a REINDEX. And forget the out of the box settings, that's not going to perform. You have to configure your database to tell it how to use memory, do auto_vacuum, etc. No configuration, no performance. – Frank Heikens Mar 06 '10 at 15:45
  • 1
    What version are you using? Do you have autovacuum running? Are you logging autovacuum calls so you can see how often it is actually cleaning up records for re-use? – Matthew Wood Mar 07 '10 at 17:36
  • 1
    And provide proper execution plan (created using explain analyze) for those queries. – Szymon Lipiński Mar 07 '10 at 20:38
  • 1
    @Frank-I'll grab version info. @Matthew-I thought autovacuum was running, but given the results, let's assume not. I'll check logs. Is vacuum doing more than just cleaning up disk space ? Should the performance drop this much just because of an table with a lot of dirty space ? – Jim Rush Mar 07 '10 at 23:15
  • 1
    @Frank-Forgot version number above. It is 8.3.7. – Jim Rush Mar 08 '10 at 21:22

5 Answers5

17

Auto vacuum should do the trick, provided you configured it for your desired performance.

Notes: VACUUM FULL: this will rebuild table statistics and reclaim loads of disk space. It locks the whole table.

VACUUM: this will rebuild table statistics and reclaim some disk space. It can be run in parallel with production system, but generates lots of IO which can impact performance.

ANALYZE: this will rebuild query planner statistics. This is triggered by VACUUM, but can be run on its own.

More detailed notes found here

Timothy
  • 2,457
  • 19
  • 15
  • 1
    Any suggestions on why vacuuming would make such a difference ? Given the docs at the link, i can only guess that "To update data statistics used by the PostgreSQL query planner." would have a performance impact. However, would a distribution inefficiency be capable of causing this much of a performance problem ? Could such an inefficiency convince PostgresQL to skip the index and perform a table scan (ie render the index useless) ? – Jim Rush Mar 08 '10 at 17:09
  • 2
    Performance gain from VACUUM can come mainly from disk space recovery and query plan optimization. Your questions about distribution inefficiency would be related to disk space recovery. (see section 23.1.2 in that link). And, yes, it can have a big impact with large enough data sets. – Timothy Mar 09 '10 at 08:07
4

This smells like index bloat to me. I'l refer you to this page

http://www.postgresql.org/docs/8.3/static/routine-reindex.html

which says at the bottom:

Also, for B-tree indexes a freshly-constructed index is somewhat faster to access than one that has been updated many times, because logically adjacent pages are usually also physically adjacent in a newly built index. (This consideration does not currently apply to non-B-tree indexes.) It might be worthwhile to reindex periodically just to improve access speed.

Which does seem to conflict with the page you referenced saying that indexes "don't require maintenance or tuning".

Have you tried "create index concurrently"?

Chris Curvey
  • 9,738
  • 10
  • 48
  • 70
3

As for performance, using strings for storing time and status info is quite a bottleneck. First of all, indexes on texts are extremely inefficient, comparing two times on the same day needs at least 11 comparison (in the format you used), however, using time type it can be reduced to simply one comparison. This also effects the size of the index, and a large index is hard to search over, and the db won't keep it in memory. Same considerations apply to the state column. If it represents a small set of states, you should use integer numbers mapped to states, this will reduce the nodes of the index - and the index size accordingly. Furthermore, this index will be useless even using theese built-in types if you don't specify the actual time in your query.

Ákos Kiss
  • 31
  • 1
  • 1
    The project requirement was to be database agnostic so we couldn't use the datetime types. But, I agree it should have been reduced to an integer. We did so on another project, but for some reason or another this team chose not to. – Jim Rush Apr 17 '11 at 13:29
  • 3
    All databases support ISO Dates and DateTime so WTF ? – bobflux May 03 '11 at 21:35
  • 1
    @peufeu +1 for pointing out that all databases have Date and DateTime data types. Also, I always see these "I'm using an ORM for database independenc" or "I'm using text fields for database independence" or "I never use stored procedures, for database independence" statements, and so **very** often those statements are in the context of some horrific performance problem or other... :-) – Craig Tullis Mar 27 '14 at 14:27
2

Is the '2010-05-20T13:00:00.000' value that xmlscheduledtime is being compared to, part of the SQL, or supplied as a parameter?

When planning how to run the query, saying that a field must be less than a supplied parameter with an as yet unknown value doesn't give PostgreSQL much to go on. It doesn't know whether that'll match nearly all the rows, or hardly any of the rows.

Reading about how the planner uses statistics helps tremendously when trying to figure out why your database is using the plans it is.

You might get better select performance by changing the order of fields in that complex index, or creating a new index, with the fields ordered (campaignfqname, currentstate, xmlscheduledtime) since then the index will take you straight to the campaign fq name and current state that you are interested in, and the index scan over the xmlscheduledtime range will all be rows you're after.

Stephen Denne
  • 36,219
  • 10
  • 45
  • 60
  • 1
    It's supplied as a the parameter. The logic involves scheduling and retrying of a significant number of work tasks. If I would have been involved in the original design, that field would have been numeric instead of text (we have a strong desire to stay as database agnostic as possible and therefore wouldn't have used timestamp field). I've wondered if the type of comparison of that field had any relation to our problem, but lack the resources to create a suitable number of test cases to better understand the problem. – Jim Rush Mar 08 '10 at 20:34
2

That's a textbook case. You should setup autovacuum to be a lot more aggressive.

bobflux
  • 11,123
  • 3
  • 27
  • 27