4

Our Postgres DB (hosted on Google Cloud SQL with 1 CPU, 3.7 GB of RAM, see below) consists mostly of one big ~90GB table with about ~60 million rows. The usage pattern consists almost exclusively of appends and a few indexed reads near the end of the table. From time to time a few users get deleted, deleting a small percentage of rows scattered across the table.

This all works fine, but every few months an autovacuum gets triggered on that table, which significantly impacts our service's performance for ~8 hours:

  • Storage usage increases by ~1GB for the duration of the autovacuum (several hours), then slowly returns to the previous value (might eventually drop below it, due to the autovacuum freeing pages)
  • Database CPU utilization jumps from <10% to ~20%
  • Disk Read/Write Ops increases from near zero to ~50/second
  • Database Memory increases slightly, but stays below 2GB
  • Transaction/sec and ingress/egress bytes are also fairly unaffected, as would be expected

This has the effect of increasing our service's 95th latency percentile from ~100ms to ~0.5-1s during the autovacuum, which in turn triggers our monitoring. The service serves around ten requests per second, with each request consisting of a few simple DB reads/writes that normally have a latency of 2-3ms each.

Here are some monitoring screenshots illustrating the issue:

CPU usage Storage usage Memory usage Read/Write operations Latency

The DB configuration is fairly vanilla:

DB configuration

The log entry documenting this autovacuum process reads as follows:

system usage: CPU 470.10s/358.74u sec elapsed 38004.58 sec
avg read rate: 2.491 MB/s, avg write rate: 2.247 MB/s
buffer usage: 8480213 hits, 12117505 misses, 10930449 dirtied
tuples: 5959839 removed, 57732135 remain, 4574 are dead but not yet removable
pages: 0 removed, 6482261 remain, 0 skipped due to pins, 0 skipped frozen
automatic vacuum of table "XXX": index scans: 1

Any suggestions what we could tune to reduce the impact of future autovacuums on our service? Or are we doing something wrong?

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
MrMage
  • 7,282
  • 2
  • 41
  • 71

1 Answers1

10

If you can increase autovacuum_vacuum_cost_delay, your autovacuum would run slower and be less invasive.

However, it is usually the best solution to make it faster by setting autovacuum_vacuum_cost_limit to 2000 or so. Then it finishes faster.

You could also try to schedule VACUUMs of the table yourself at times when it hurts least.

But frankly, if a single innocuous autovacuum is enough to disturb your operation, you need more I/O bandwidth.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • Thank you, increasing `autovacuum_vacuum_cost_delay` and *decreasing* `autovacuum_vacuum_cost_limit` helped improve service performance, but of course makes the `autovacuum` take longer (but that's fine). My suspicion is that Google's ~3k IOPS and 50 MB/s throughput limits (see https://cloud.google.com/compute/docs/disks/performance) for a 100 GB persistent disk are at fault here. – MrMage Feb 25 '19 at 14:11
  • Delaying autovacuum even more is a dangerous path to go down. You have been warned! I'd shell out the bucks for more storage bandwidth. – Laurenz Albe Feb 25 '19 at 14:41
  • I have actually *increased* the frequency of autovacuums (reduced the scale factor from 0.2 to 0.01) while reducing their speed. So each individual autovacuum should have less work to do. – MrMage Feb 25 '19 at 18:04
  • That will make autovacuum run all the time. You should monitor if it gets done and bloat does not increase. – Laurenz Albe Feb 25 '19 at 18:37
  • How can VACUUMs be schedudeld on GCP @LaurenzAlbe? – Tobias Dec 16 '19 at 09:48
  • No idea. Normally you let autovacuum do the job - it schedules the operations automatically. – Laurenz Albe Dec 16 '19 at 10:17