PostgreSQL ERROR: canceling statement due to conflict with recovery

Question

I'm getting the following error when running a query on a PostgreSQL db in standby mode. The query that causes the error works fine for 1 month but when you query for more than 1 month an error results.

ERROR: canceling statement due to conflict with recovery
Detail: User query might have needed to see row versions that must be removed

Any suggestions on how to resolve? Thanks

Please find the AWS doc which mentioned this error it also has the solution https://aws.amazon.com/blogs/database/best-practices-for-amazon-rds-postgresql-replication/ — arunjos007, Dec 04 '19 at 10:15

score 208 · Answer 1 · edited May 10 '23 at 10:36

208

No need to touch hot_standby_feedback. As others have mentioned, setting it to on can bloat master. Imagine opening a transaction on a slave and not closing it.

Instead, set max_standby_archive_delay and max_standby_streaming_delay to sane values:

# /etc/postgresql/10/main/postgresql.conf on a slave
max_standby_archive_delay = 900s
max_standby_streaming_delay = 900s

This way queries on slaves with a duration less than 900 seconds won't be cancelled. If your workload requires longer queries, just set these options to a higher value.

The postgres docs discuss this at some length. Key advice from there is:

if the standby server is meant for executing long-running queries, then a high or even infinite delay value [in max_standby_archive_delay and max_standby_streaming_delay] may be preferable

and

Users should be clear that tables that are regularly and heavily updated on the primary server will quickly cause cancellation of longer running queries on the standby. In such cases the setting of a finite value for max_standby_archive_delay or max_standby_streaming_delay can be considered similar to setting statement_timeout.

You can also consider setting vacuum_defer_cleanup_age (on the primary) in combination with the max standby delays. As the docs say:

Another option is to increase vacuum_defer_cleanup_age on the primary server, so that dead rows will not be cleaned up as quickly as they normally would be. This will allow more time for queries to execute before they are canceled on the standby, without having to set a high max_standby_streaming_delay

edited May 10 '23 at 10:36

daphtdazz

7,754
34
54

answered Dec 05 '17 at 19:33

Max Malysh

29,384
19
111
115

1

This is the solution we ended up using. Seems like the best compromise between all the options presented here. – mohit6up Oct 11 '18 at 14:04
8

This is the best answer. Note as per the docs these are cumulative; if you have multiple queries on the replica holding up the replication then it could be that you get to 899 then another 2 second query gets cancelled. It's best to just implement some exponential back-off in your code. Also, the streaming delay is in effect while the replication is streaming. If replication can't keep up with streaming it will move to replicating from archive. If you are replicating from archive you should probably let it catch up, `max_standby_archive_delay` might need to be smaller than the other. – Davos Nov 13 '18 at 06:06
3

This is still the best solution here. Take note that in Redshift, you can set this via parameter group settings, only that it should be in `ms`, i.e. 900s = 16 minutes = 900000ms. – WGS May 07 '19 at 07:21
2

To update this on GCP, also done in `ms` https://cloud.google.com/sql/docs/postgres/flags?hl=en_US&_ga=2.48390694.-1061973720.1536093593 – howMuchCheeseIsTooMuchCheese Jun 19 '19 at 13:37
Provided the purpose of standby is for e.g. reporting and it's not a hot standby that needs to be ready to handle failover, this is absolutely the best answer. – soupdog Jul 18 '19 at 14:12
3

we have configuration (master and 2 hot standby nodes) and all 3 nodes has following settings: **max_standby_archive_delay = 30s**, **max_standby_streaming_delay = 30s**, **hot_standby_feedback = off ** but we are facing errors `canceling statement due to conflict with recovery` even for queries which takes 100ms-2000ms to complete (throughput during load test ~120r/s) why short running queries are cancelled on standby nodes even though timeout is set for 30s...? – Erikas Neverdauskas Apr 01 '20 at 10:46
3

I was using RDS with a read replica - where my queries were being canceled. To get this to work: - You will need to create a new Parameters Group - Change the values of `max_standby_archive_delay` and `max_standby_streaming_delay` in the new parameters group - Modify the instance or replica to have the new parameter group - Reboot the instance to apply. Link https://forums.aws.amazon.com/thread.jspa?messageID=943196&tstart=0 – Danstan Sep 17 '20 at 10:20
We used this solution too. It started to happen when we moved from Heroku Postgres (leader and follower) to AWS RDS databases – lucianokrebs Dec 16 '20 at 18:53
This saved my day. I was struggling with the exact same problem for about a day. This solution was perfect. – nagylzs Feb 01 '21 at 06:05
2

@ErikasNeverdauskas Maybe this answer your question "Note that max_standby_archive_delay is not the same as the maximum length of time a query can run before cancellation; rather it is the maximum total time allowed to apply any one WAL segment's data. Thus, if one query has resulted in significant delay earlier in the WAL segment, subsequent conflicting queries will have much less grace time." Extracted from the [pg docs](https://www.postgresql.org/docs/12/runtime-config-replication.html) – mateusppereira Mar 22 '21 at 16:29
1

Another reason this is the right answer: https://dba.stackexchange.com/questions/329293/postgres-hot-standby-feedback-supersedes-or-uses-vacuum-defer-cleanup-age/329299?noredirect=1#comment641634_329299 – janst Jul 17 '23 at 20:01
isn't this a trade off and higher values can lead to more data loss if master fails and a switch over to replica occurs? – Adam from WALCZAK.IT Aug 03 '23 at 10:43

Tometzky · Accepted Answer · 2013-01-30T08:20:47.083

137

Running queries on hot-standby server is somewhat tricky — it can fail, because during querying some needed rows might be updated or deleted on primary. As a primary does not know that a query is started on secondary it thinks it can clean up (vacuum) old versions of its rows. Then secondary has to replay this cleanup, and has to forcibly cancel all queries which can use these rows.

Longer queries will be canceled more often.

You can work around this by starting a repeatable read transaction on primary which does a dummy query and then sits idle while a real query is run on secondary. Its presence will prevent vacuuming of old row versions on primary.

More on this subject and other workarounds are explained in Hot Standby — Handling Query Conflicts section in documentation.

edited Jan 30 '13 at 08:20

answered Jan 29 '13 at 23:51

Tometzky

22,573
5
59
73

13

To users of PostgreSQL 9.1+: see **eradman**'s answer below for a practical solution. – Zoltán Jul 08 '14 at 16:21
9

To users of PostgreSQL 9.1+: max-malysh 's answer is much saner. Don't do the eradman suggestion unless you understand the risks. – Davos Nov 13 '18 at 06:12

eradman · Answer 3 · 2015-07-24T08:16:02.137

94

There's no need to start idle transactions on the master. In postgresql-9.1 the most direct way to solve this problem is by setting

hot_standby_feedback = on

This will make the master aware of long-running queries. From the docs:

The first option is to set the parameter hot_standby_feedback, which prevents VACUUM from removing recently-dead rows and so cleanup conflicts do not occur.

Why isn't this the default? This parameter was added after the initial implementation and it's the only way that a standby can affect a master.

edited Jul 24 '15 at 08:16

answered Feb 10 '14 at 19:34

eradman

1,996
1
17
23

16

This param should be set on the standby. – Steve Kehlet Aug 20 '14 at 23:51
3

There are some disadvantages for master in this case [Hot-Standby-Feedback](http://postgresql.nabble.com/Hot-Standby-Feedback-should-default-to-on-in-9-3-td5734379.html) – Evgeny Liskovets Sep 30 '15 at 09:03
3

[Web archive link](http://web.archive.org/web/20150324004227/http://postgresql.nabble.com/Hot-Standby-Feedback-should-default-to-on-in-9-3-td5734379.html) for the link in @EvgenyLiskovets's comment. – D Malan May 19 '22 at 12:39

Gilles Quénot · Answer 4 · 2018-10-03T20:23:22.270

60

As stated here about hot_standby_feedback = on :

Well, the disadvantage of it is that the standby can bloat the master, which might be surprising to some people, too

And here:

With what setting of max_standby_streaming_delay? I would rather default that to -1 than default hot_standby_feedback on. That way what you do on the standby only affects the standby

So I added

max_standby_streaming_delay = -1

And no more pg_dump error for us, nor master bloat :)

For AWS RDS instance, check http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.html

edited Oct 03 '18 at 20:23

answered Oct 22 '15 at 13:58

Gilles Quénot

173,512
41
224
223

Do you add this config to the slave? – lennard Aug 28 '16 at 09:51
1

@lennard, this worked for me. I added this config on the slave's postgresql.conf, then restarted the slave. – Ardee Aram Sep 20 '16 at 03:49
19

You can get unbounded replica lag this way, of course. And if you're using a replication slot to connect replica to master, that can result in excessive xlog retention on master, so it's really only viable if you're using WAL archiving. – Craig Ringer Nov 18 '16 at 07:09
7

How set this on AWS RDS? – Kris MP Feb 23 '17 at 02:57
1

@KrisMP Use psql – Yehonatan Apr 12 '17 at 13:29
5

@KrisMP in the parameter group - http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.html – r3m0t Jul 05 '17 at 17:13
1

As per Craig's comment, xlog retention danger! You can fill your disk doing this. – Davos Nov 13 '18 at 06:08

score 19 · Answer 5 · answered Oct 05 '17 at 08:56

19

The table data on the hot standby slave server is modified while a long running query is running. A solution (PostgreSQL 9.1+) to make sure the table data is not modified is to suspend the replication and resume after the query:

select pg_xlog_replay_pause(); -- suspend
select * from foo; -- your query
select pg_xlog_replay_resume(); --resume

answered Oct 05 '17 at 08:56

David Jaspers

241
2
5

3

This requires superuser rights. So it might not be a solution in some cases. – Joao Baltazar Apr 09 '18 at 11:54
5

In PostgreSQL 10, `xlog` was replaced with `wal`, so you want to call `pg_wal_replay_pause()` and `pg_wal_replay_resume()`. – womble Feb 13 '19 at 21:05

Artif3x · Answer 6 · 2020-02-19T16:33:46.130

I'm going to add some updated info and references to @max-malysh's excellent answer above.

In short, if you do something on the master, it needs to be replicated on the slave. Postgres uses WAL records for this, which are sent after every logged action on the master to the slave. The slave then executes the action and the two are again in sync. In one of several scenarios, you can be in conflict on the slave with what's coming in from the master in a WAL action. In most of them, there's a transaction happening on the slave which conflicts with what the WAL action wants to change. In that case, you have two options:

Delay the application of the WAL action for a bit, allowing the slave to finish its conflicting transaction, then apply the action.
Cancel the conflicting query on the slave.

We're concerned with #1, and two values:

max_standby_archive_delay - this is the delay used after a long disconnection between the master and slave, when the data is being read from a WAL archive, which is not current data.
max_standby_streaming_delay - delay used for cancelling queries when WAL entries are received via streaming replication.

Generally, if your server is meant for high availability replication, you want to keep these numbers short. The default setting of 30000 (milliseconds if no units given) is sufficient for this. If, however, you want to set up something like an archive, reporting- or read-replica that might have very long-running queries, then you'll want to set this to something higher to avoid cancelled queries. The recommended 900s setting above seems like a good starting point. I disagree with the official docs on setting an infinite value -1 as being a good idea--that could mask some buggy code and cause lots of issues.

The one caveat about long-running queries and setting these values higher is that other queries running on the slave in parallel with the long-running one which is causing the WAL action to be delayed will see old data until the long query has completed. Developers will need to understand this and serialize queries which shouldn't run simultaneously.

For the full explanation of how max_standby_archive_delay and max_standby_streaming_delay work and why, go here.

score 5 · Answer 7 · edited Jun 20 '20 at 09:12

It might be too late for the answer but we face the same kind of issue on the production. Earlier we have only one RDS and as the number of users increases on the app side, we decided to add Read Replica for it. Read replica works properly on the staging but once we moved to the production we start getting the same error.

So we solve this by enabling hot_standby_feedback property in the Postgres properties. We referred the following link

https://aws.amazon.com/blogs/database/best-practices-for-amazon-rds-postgresql-replication/

I hope it will help.

score 3 · Answer 8 · edited Jun 20 '20 at 09:12

Likewise, here's a 2nd caveat to @Artif3x elaboration of @max-malysh's excellent answer, both above.

With any delayed application of transactions from the master the follower(s) will have an older, stale view of the data. Therefore while providing time for the query on the follower to finish by setting max_standby_archive_delay and max_standby_streaming_delay makes sense, keep both of these caveats in mind:

the value of the follower as a standby / backup diminishes
any other queries running on the follower may return stale data.

If the value of the follower for backup ends up being too much in conflict with hosting queries, one solution would be multiple followers, each optimized for one or the other.

Also, note that several queries in a row can cause the application of wal entries to keep being delayed. So when choosing the new values, it’s not just the time for a single query, but a moving window that starts whenever a conflicting query starts, and ends when the wal entry is finally applied.

PostgreSQL ERROR: canceling statement due to conflict with recovery

8 Answers8

Linked

Related