Replication on Postgresql pauses when Querying and replication are happening simultaneously

Question

Postgress follows MVCC rules. So any query that is run on a table doesn't conflict with the writes that happen on the table. The query returns the result based on the snapshot at the point of running the query.

Now i have a master and slave. The slave is used by analysts to run queries and to perform analysis. When the slave is replicating and when analyst are running their queries simultaneously, i can see the replication lag for a long time.If the queries are long running, the replication lags a long duration and if the number of writes on the master happens to be pretty high, then i end up losing the WAL files and replication can longer proceed. I just have to spin up another slave. Why does this happen ? How do i allow queries and replication to happen simultaneously on postures ? Is there any parameter setting that i can apply to make this happen ?

To add, postgress replication is single threaded...So if i am querying table A on slave, i guess table A gets locks for replication. Now if table B has inserts/updates happening on the master, they are waiting on the WAL files behind table A...So unless querying on table A completes, replication cannot proceed...If it were multithreaded, table B insert/updates would know it has nothing to do with table A and can proceed without delays...Am i right in understanding postgres replication mechanism ? — Ramanan, May 06 '16 at 01:23
" the replication lags a long duration and if the number of writes on the master happens to be pretty high, then i end up losing the WAL files and replication can longer proceed" that defintely indicates that replication hasn't been set up properly. Why don't you update the question with more details. Extracts from your config file would also help. — e4c5, May 06 '16 at 01:42
Also this question might be better suited for dba.stackexchange.com — e4c5, May 06 '16 at 01:43

score 2 · Answer 1 · edited May 23 '17 at 10:28

The replica can't apply more WAL from the master because the master might've overwritten data blocks still needed by queries running on the replica that're older than any still running on the master. The replica needs older row versions than the master. It's exactly because of MVCC that this pause is necessary.

You probably set a high max_standby_streaming_delay to avoid "canceling statement due to conflict with recovery" errors.

If you turn hot_standby_feedback on, the replica can instead tell the master to keep those rows. But the master can't clean up free space as efficiently then, and it might run out of space in pg_xlog if the standby gets way too far behind.

See PostgreSQL manual: Handling Query Conflicts.

As for the WAL retention part: enable WAL archiving and a restore_command for your standbys. You should really be using it anyway, for point-in-time recovery. PgBarman now makes this easy with the barman get-wal command. If you don't want WAL archiving you can instead set your replica servers up to use a replication slot to connect to the master, so the master knows to retain the WAL they need indefinitely. Of course, that can cause the master to run out of space in pg_xlog and stop running so you need to monitor more closely if you do that.

Thanks for the response. I will try out hot_standby_feedback on with our setup. My understanding from your explanation is that dead rows will still be kept on the master server, hence bloating the tables for some time...So when does the cleanup on the primary happen ? Does the cleanup happen when the query finishes executing? — Ramanan, May 06 '16 at 14:27

Replication on Postgresql pauses when Querying and replication are happening simultaneously

1 Answers1