8

I have the following UPDATE query:

UPDATE Indexer.Pages SET LastError=NULL where LastError is not null;

Right now, this query takes about 93 minutes to complete. I'd like to find ways to make this a bit faster.

The Indexer.Pages table has around 506,000 rows, and about 490,000 of them contain a value for LastError, so I doubt I can take advantage of any indexes here.

The table (when uncompressed) has about 46 gigs of data in it, however the majority of that data is in a text field called html. I believe simply loading and unloading that many pages is causing the slowdown. One idea would be to make a new table with just the Id and the html field, and keep Indexer.Pages as small as possible. However, testing this theory would be a decent amount of work since I actually don't have the hard disk space to create a copy of the table. I'd have to copy it over to another machine, drop the table, then copy the data back which would probably take all evening.

Ideas? I'm using Postgres 9.0.0.

UPDATE:

Here's the schema:

CREATE TABLE indexer.pages
(
  id uuid NOT NULL,
  url character varying(1024) NOT NULL,
  firstcrawled timestamp with time zone NOT NULL,
  lastcrawled timestamp with time zone NOT NULL,
  recipeid uuid,
  html text NOT NULL,
  lasterror character varying(1024),
  missingings smallint,
  CONSTRAINT pages_pkey PRIMARY KEY (id ),
  CONSTRAINT indexer_pages_uniqueurl UNIQUE (url )
);

I also have two indexes:

CREATE INDEX idx_indexer_pages_missingings
  ON indexer.pages
  USING btree
  (missingings )
  WHERE missingings > 0;

and

CREATE INDEX idx_indexer_pages_null
  ON indexer.pages
  USING btree
  (recipeid )
  WHERE NULL::boolean;

There are no triggers on this table, and there is one other table that has a FK constraint on Pages.PageId.

Mike Christensen
  • 88,082
  • 50
  • 208
  • 326
  • Updating 500,000 rows shouldn't take 93 minutes in the first place. I assume there is something other involved. Can you show us the table definition? Also copying 500,000 rows should be done in a couple of minutes (if not seconds using `COPY`) not the "whole evening". –  Jun 18 '12 at 16:41
  • 3
    Unless the html is normally very small, it will automatically be stored in a separate TOAST table behind the scenes and will not be significant in this update. Any triggers or foreign key definitions could be very significant -- are there any? Are there any indexes which reference the LastError column? If so, they are likely to be an issue. If you can arrange to update in smaller batches (using key ranges, for example) and VACUUM in between batches, you will avoid table bloat. Finally, update: http://www.postgresql.org/support/versioning/ A X.Y.0 release is not a good place to be. – kgrittn Jun 18 '12 at 16:41
  • I'll post more schema info later on today. However, right now I can say there's no triggers, and this schema was designed for fast inserts, so there's not much in the way of indexes. – Mike Christensen Jun 18 '12 at 17:03
  • Like @kgrittn suggested. Try and do batched transactions instead of going at all the data in one transaction. Is there a primary key and is it a sequence number that you could grab the min and max for batches of 10K to start(you will want to play with batch size to find the sweet spot). No matter what database (Oracle, SQL Server, Postgre) Sometimes going after a huge amount of data in one transaction ends up slowing things down. – Kuberchaun Jun 18 '12 at 19:30
  • @a_horse_with_no_name - I've updated with the schema. – Mike Christensen Jun 19 '12 at 02:00
  • This does sound strange. For the update the HOT feature should kick in so you shouldn't be affected by the indexes. When I run an update on a 750k table on my notebook that sets a single non-indexed column to null for 600k rows, that takes abut 5 seconds. –  Jun 19 '12 at 06:39
  • @a_horse_with_no_name: As I wrote in my (updated) answer: HOT can only reuse dead tuples from previous transactions. So it could reuse existing table bloat. But it cannot reuse dead tuples inside the same transaction. At least that is my understanding. Splitting up the operation into a handful of logical slices (that are spread out physically) may prevent further table bloat and be faster, too. – Erwin Brandstetter Jun 20 '12 at 13:36

3 Answers3

6

What @kgrittn posted as comment is the best answer so far. I am merely filling in details.

Before you do anything else, you should upgrade PostgreSQL to a current version, at least to the last security release of your major version. See guidelines on the project.

I also want to stress what Kevin mentioned about indexes involving the column LastError. Normally, HOT updates can recycle dead rows on a data page and make UPDATEs a lot faster - effectively removing (most of) the need for vacuuming. Related:

If your column is used in any index in any way, HOT UPDATEs are disabled, because it would break the index(es). If that is the case, you should be able to speed up the query a lot by deleting all of these indexes before you UPDATE and recreate them later.

In this context it would help to run multiple smaller UPDATEs: If ...
... the updated column is not involved in any indexes (enabling HOT updates). ... the UPDATE is easily divided into multiple patches in multiple transactions. ... the rows in those patches are spread out over the table (physically, not logically). ... there are no other concurrent transactions keeping dead tuples from being reused.

Then you would not need to VACCUUM in between multiple patches, because HOT updates can reuse dead tuples directly - only dead tuples from previous transactions, not from the same or concurrent ones. You may want to schedule a VACUUM at the end of the operation, or just let auto-vacuuming do its job.

The same could be done with any other index that is not needed for the UPDATE - and judging from your numbers the UPDATE is not going to use an index anyway. If you update large parts of your table, building new indexes from scratch is much faster than incrementally updating indexes with every changed row.

Also, your update is not likely to break any foreign key constraints. You could try to delete & recreate those, too. This does open a time slot where referential integrity would not be enforced. If the integrity is violated during the UPDATE you get an error when trying to recreate the FK. If you do it all within one transaction, concurrent transactions never get to see the dropped FK, but you take a write lock on the table - same as with dropping / recreating indexes or triggers)

Lastly, disable & enable triggers that are not needed for the update.

Be sure to do all of this in one transaction. Maybe do it in a number of smaller patches, so it does not block concurrent operations for too long.

So:

BEGIN;
ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)
-- UPDATE ...
-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
COMMIT;

You cannot run VACUUM inside a transaction block. Per documentation:

VACUUM cannot be executed inside a transaction block.

You could split your operation into a few big chunks and run in between:

VACUUM ANALYZE tbl;

If you don't have to deal with concurrent transactions you could (even more effectively):

ALTER TABLE tbl DISABLE TRIGGER user; -- disable all self-made triggers
-- DROP indexes (& fk constraints ?)

-- Multiple UPDATEs with logical slices of the table
-- each slice in its own transaction.
-- VACUUM ANALYZE tbl;  -- optionally in between, or autovacuum kicks in

-- RECREATE indexes (& fk constraints ?)
ALTER TABLE tbl ENABLE TRIGGER user;
Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    Minor points: Incremental vacuums will free tuple line pointers, which HOT pruning during normal access can't. That *may* make it worthwhile between `UPDATE` steps. I recommend a `VACUUM FREEZE ANALYZE` on the table at the end of the process if these rows are expected to remain (without further `UPDATE` or `DELETE`) for a long time; otherwise the autovacuum jobs which kick in to freeze tuples to prevent transaction ID wraparound can be painful. You may be able to rewrite the tuples with both hint bit info and the frozen transaction IDs at this point, saving a little overhead. – kgrittn Jun 22 '12 at 19:50
  • 1
    BTW, people are currently addressing some of the long-standing performance issues with foreign keys -- including being a lot smarter about dodging overhead if the particular changes made can't cause a referential integrity problem. If this answer is still around after 9.3 is out, dropping foreign keys may be much less important. (And yes, I do mean 9.3, which will probably be released in the summer of 2013.) – kgrittn Jun 22 '12 at 19:55
1
UPDATE Indexer.Pages 
  SET LastError=NULL
  ;

The where clause is not needed since the NULL fields are already NULL, so it won't harm to set them to NULL again (I don't think this would affect performance significantly).

Given your number_of_rows = 500K and your table size=46G, I conclude that your average rowsize is 90KB. That is huge. Maybe you could move {unused, sparse} columns of your table to other tables?

wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • I tend to be on the same page as you, however @kgrittn mentioned since these rows are toast'ed off the main table, it shouldn't affect update perf since I don't touch those columns. I'm curious if this is true. Unfortunately, I won't be able to try it until later this afternoon when I get home. – Mike Christensen Jun 18 '12 at 17:01
  • 1
    @MikeChristensen: Oh, this advice is not good! It would lead to empty UPDATEs slowing down your query. You should definitely include the WHERE clause. Also, as PostgreSQL uses TOAST tables for big values, it will probably not help much to split your table. – Erwin Brandstetter Jun 18 '12 at 19:44
  • Besides the raw cost of the unnecessary row updates mentioned by @ErwinBrandstetter, you would unnecessarily bloat your table -- causing all subsequent access to be slow until you do aggressive maintenance. Note that in PostgreSQL, an UPDATE creates a new version of the row at some new position, while leaving the old version of the row for later cleanup by a vacuum process. An UPDATE with no WHERE clause is going to double the size of your base table. This is why I was recommending multiple smaller UPDATEs with VACUUMs in between, so the space from the old rows can be reused. – kgrittn Jun 18 '12 at 20:53
  • As I said: since 10% of the rows are already NULL, It won't cause too much harm to NULL them again. that's just 10% extra row versions. The real problem is in the table structure and the row size. I'd like to see the schema. This could be (or could not be) toast tables affected. – wildplasser Jun 18 '12 at 21:25
  • @kgrittn: .. or let HOT updates reuse the dead rows removing the need for VACUUM in between. I am pretty sure, but I would like to run this by you: HOT UPDATEs cannot resue dead tuples inside a single transaction, only from previous transactions, right? – Erwin Brandstetter Jun 18 '12 at 22:12
  • @wildplasser: 10 % isn't too much harm, but there is really not good reason to take *any* harm here. It is very good advice in many `UPDATE` situations to use such a `WHERE` clause preventing empty UPDATES, which are normally just as expensive as any other UPDATE. – Erwin Brandstetter Jun 18 '12 at 22:18
  • @ErwinBrandstetter: I know. But I fear that the real "harm" is in the data model. (which is not shown. And *should* be shown, IMHO) – wildplasser Jun 18 '12 at 22:27
  • @wildplasser: You may be on to something there. – Erwin Brandstetter Jun 18 '12 at 22:28
0

Your theory is probably correct. Reading the full table (and then doing anything) is probably causing the slow-down.

Why don't you just create another table that has PageId and LastError? Initialize this with the data in the table you have now (which should take less than 93 minutes). Then, use the LastError from the new table.

At your leisure, you can remove LastError from your existing table.

By the way, I don't normally recommend keeping two copies of a column in two separate tables. In this case, though, you sound like you are stuck and need a way to proceed.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Great idea about creating a new table with the smaller amount of data, then dropping the other columns from the first. Then I could just rename the tables when I'm done. Also, the new table could just have `HtmlId` and `Html`, and `Indexer.Pages` could have a reference to `HtmlId` - that might be a bit more *normal*. – Mike Christensen Jun 18 '12 at 16:20