Long running queries on a postgres aurora read replica have triggered a query conflict with the ERROR: canceling statement due to conflict with recovery
- similar to what is described in this question. I was planning to increase the max_standby_archive_delay
and max_standby_archive_delay
mentioned in the top solution (they are set to the defaults of 14s and 30s). However, I am concerned about what the risks are of increasing this value - particularly the risk of the replica data becoming stale or out-of-date.
My question is: does this error indicate that the replica is querying data that is 14s or more behind the primary? Will increasing these values risk querying data is that is even more out of date?
When I checked the ReplicaLag
metric on AWS, the max during this time was still very low - around 150ms - which is way under the existing default timeouts of 14 and 30s. Is this just the lag to writing to the log but not applying the WAL on the replica? Is there another metric that tells me how "stale" the data is on the replica?
I found this excellent article on replication conflicts and used pg_stat_database_conflicts
to determine that the conflicts were all due to "Buffer pin replication conflicts", which are apparently not straightforward to resolve other than with those settings.