4

I am considering various MVCC-capable databases for an upcoming project and PostgreSQL came onto my radar.

The requirements of my program involve a sequence roughly like the following:

  1. Read some information from the current version of the database, modify 80-90% of the data and write it back in one or more transactions (imagine something like updating the grid in Conway's Game of Life where both the old and new state of the grid are required).

  2. Wait 1-2 minutes after the commit. During this time, clients can issues reads against new data.

  3. Repeat.

The databases will be limited to something like 2-4GB.

~90% of the changes are updates to existing objects, ~5% will be new objects and ~5% will be deleted objects.

So my question is, can I reasonably run a plain VACUUM command as step 1.5 once every 1-2 minutes and have PostgreSQL be able to keep up with potentially 2-3+GB of changes being made each time?

MindJuice
  • 4,121
  • 3
  • 29
  • 41
  • 5
    You probably don't need to run it manually. Adjusting the auto vacuum settings for that specific table should be enough. But vacuum is only really necessary when you delete or insert a large number of rows. The updates won't need such an aggressive vacuum. –  Mar 13 '12 at 22:50
  • My understanding is that every update also generates a new record with a new XID, and since I will update 80-90% of the objects each cycle, I am expecting to have many "old" records to cleanup. – MindJuice Mar 13 '12 at 22:56
  • It is perhaps also important to note that while step 1 is running, clients may also be issuing reads against the "old" state of the database from step "0", so those old records need to be available while the new ones are being generated. – MindJuice Mar 13 '12 at 22:59
  • 3
    @MindJuice: You are right about UPDATEs leaving dead tuples behind. But Postgres can reuse that space without vacuuming by means of [HOT ("heap-only Tuple") updates](http://pgsql.tapoueh.org/site/html/misc/hot.html). Some exceptions apply - notably if indexed columns are changed in the update. On a different note, and out of curiosity: which other MVCC databases did you have in mind? – Erwin Brandstetter Mar 14 '12 at 00:05
  • I took a quick look at HyperSQL (HSQLDB), CouchDB and H2. Any others worth considering? – MindJuice Mar 14 '12 at 00:15
  • 2
    Since you're doing a massive amount of updating on each pass, it may be worthwhile creating the table with a sparse [fillfactor](http://www.postgresql.org/docs/9.1/static/sql-createtable.html) to allow plenty of space in each heap page for HOT updates. – dbenhur Mar 14 '12 at 06:56

1 Answers1

5

I believe Postgres should do fine in this scenario. The scenario is uncommon enough that manual vacuum between the huge updates seems like a reasonable option.

Consider if You can make it so that instead of the huge updates, You generate a new set of tables, analyze them (necessary!), and then, with power of transactional ddl, drop the old ones and rename the new ones into their place. This should alleviate Your worries about VACUUM.

In such scenario, You should be doing some serious tuning. Especially, look at shared_buffers, checkpoint-related parameters, and vacuum-related ones. Also, remember about benchmarking with realistic workloads.

maniek
  • 7,087
  • 2
  • 20
  • 43
  • Interesting suggestion about using two separate tables and renaming at the end. That just might work for me. I'll ponder this a bit. Thanks! – MindJuice Mar 14 '12 at 00:12
  • To rename a table, the database first has to lock the table. This is much slower than an ordinary row lock for an update. – Frank Heikens Mar 14 '12 at 07:46
  • 1
    @FrankHeikens: This is a tradeoff, the OP wants to update almost WHOLE table, a brief exclusive lock could easily be better than dealing with VACUUM and stuff. This is especially true if the readers issue only short queries. Alternatively, One can imagine doing this in the client - a small search_path dance could mean that the "old" reader uses the tables in the old schema, new readers use ones in new schema, while in background You are preparing another version. Then You drop schemas not used by anyone any more. – maniek Mar 14 '12 at 12:24
  • 3
    If you go with the insert into a new table for each cycle, be sure to have a single transaction rename the table-in-use to an "old" name and rename the new table to the table-in-use. Leave some time between the commit of this transaction and dropping the old table, since there is a small window of time where transactions planned with the old table's OID may still be executing after the commit. You might want to start the "move the new transaction into place" transaction with a DROP TABLE IF EXISTS statement for the "old" table name. – kgrittn Apr 09 '12 at 05:18