0

I using 3 RDS Postgres Instance (1 master, 2 slave, version 13). And my Postgres config is using the default parameter group.

max_standby_archive_delay = 30s
max_standby_streaming_delay = 30s

Recently, the following error occurs frequently.

ERROR: canceling statement due to conflict with recovery
Detail: User query might have needed to see row versions that must be removed

Reading the Postgres documentation, it is said that the above error may occur if cleanUp(Vacuum) occurs on the master while the query running on the slave is looking at the previous version snapshot.

However, I did not run the vacuum manually, and when I look up last_auto_vacuum in master, it is a long time before the error occurred.

A lot of data is modified and added due to many API requests at that time, but there is no query that actually takes more than 3 seconds among my queries.

  1. Why does the error occur? Vacuum has never been done manually and Auto Vacuum has not been run at that time.
  2. Does the max_standby_streaming_delay setting mean the maximum waiting time after the first delay?

The wal replay is delayed because a query that takes 3 seconds is being executed on the slave. If another query takes 3 seconds just before the end of the first query and this is repeated, the query requested 30 seconds after the delay of the first query is the above error is to get

00:00:00 -> running Query1 in slave (run time 3s)
00:00:01 -> slave wal_receiver received wal for sync (but is delayed because query1)
00:00:02 -> Query2 request in slave (run time 3s)
00:00:03 -> Query1 end and Query3 request in slave (run time 3s)
00:00:04 -> Query4 request in slave (run time 3s)
....
00:00:30 -> Query30 request in slave (run time 3s)
00:00:31 -> A delayed wal replay is executed. (Are Query 29 and Query 30 canceled with the above error message?)
  • What is "last_auto_vacuum"? – jjanes Nov 02 '22 at 13:17
  • You have a block quote there, but what is it you are quoting? – jjanes Nov 02 '22 at 13:19
  • You can get replication conflicts even if you don't run `VACUUM` manually. See [this article](https://www.cybertec-postgresql.com/en/streaming-replication-conflicts-in-postgresql/) for details. – Laurenz Albe Nov 02 '22 at 13:31
  • @LaurenzAlbe I also read that article. I tried to consider the possibility of other conflicts, but only confl_snapshot is increased in the result of `select * from pg_stat_database_conflicts`. – prospective developer Nov 02 '22 at 16:01
  • There is also "page pruning" that removes table rows without `VACUUM`. That can also lead to a snapshot conflict. Hint: you will never totally get rid of replication conflicts. – Laurenz Albe Nov 02 '22 at 16:05
  • @jjanes last_autovacuum is result of `select * from pg_catalog.pg_stat_all_tables`. Sorry, using quote blocks is a mistake. The content in the quote blocks is an example for question 2. – prospective developer Nov 02 '22 at 16:05
  • @LaurenzAlbe Thank you for answer. Not all replication conflicts can be prevented, but over 10% of requests in 3 hours failed due to conflicts. An accurate cause analysis is required. Is there any documentation to reference that snapshot conflicts can arise for reasons other than the `VACUUM`? – prospective developer Nov 02 '22 at 16:35
  • 1
    pg_stat_all_tables.last_autovacuum doesn't update until the vacuum finishes, so there could still be an ongoing autovacuum. Also, things other than vacuum, such as HOT-chain pruning, can cause this issue. – jjanes Nov 02 '22 at 16:41
  • That's all documented in the source. Start with `src/backend/access/heap/README.HOT`. – Laurenz Albe Nov 02 '22 at 20:08

0 Answers0