The PostgreSQL service on standby node starts and stops again and again after Patroni started up

Question

The PostgreSQL service on the standby node of my PostgreSQL cluster starts and stops again and again after I started the Patroni service.

I want to build a PostgreSQL HA cluster on two computers using Patroni 1.6.4 and etcd 3.3. Firstly, I built an etcd cluster, which is healthy.

member 230e646882873b50 is healthy: got healthy result from http://10.19.170.119:2379
member afcefe35d67a646c is healthy: got healthy result from http://10.19.170.155:2379
cluster is healthy

Next, I built a stream replication PostgreSQL cluster on two computers (running on port 5433), which worked fine.

Then I stopped then PostgreSQL cluster and started Patroni on the primary server and the standby server.

The PostgreSQL and Patroni service on the primary server seemed working normally.

2020-04-02 18:17:22,402 INFO: Lock owner: pgsql_119; I am pgsql_119
2020-04-02 18:17:22,430 INFO: no action.  i am the leader with the lock
2020-04-02 18:17:26,402 INFO: Lock owner: pgsql_119; I am pgsql_119
2020-04-02 18:17:26,430 INFO: no action.  i am the leader with the lock

But a problem occurred on the standby server. Patroni on the standby server printed messages as follows:

2020-04-02 18:45:25,995 INFO: no action.  i am a secondary and i am following aleader
2020-04-02 18:45:27,285 INFO: closed patroni connection to the postgresql cluster
2020-04-02 18:45:27,722 INFO: postmaster pid=7448
2020-04-02 18:45:27.832 HKT [7448] LOG:  listening on IPv4 address "0.0.0.0", port 5433
2020-04-02 18:45:27.994 HKT [7448] LOG:  redirecting log output to logging collector process
2020-04-02 18:45:27.994 HKT [7448] HINT:  Future log output will appear in directory "log".
2020-04-02 18:45:30,058 INFO: Lock owner: pgsql_node119; I am pgsql_node155
2020-04-02 18:45:30,058 INFO: does not have lock
2020-04-02 18:45:30,058 INFO: establishing a new patroni connection to the postgres cluster
2020-04-02 18:45:31,162 INFO: no action.  i am a secondary and i am following aleader
2020-04-02 18:45:32,460 INFO: closed patroni connection to the postgresql cluster
2020-04-02 18:45:32,875 INFO: postmaster pid=8820
2020-04-02 18:45:32.996 HKT [8820] LOG:  listening on IPv4 address "0.0.0.0", port 5433
2020-04-02 18:45:33.161 HKT [8820] LOG:  redirecting log output to logging collector process
2020-04-02 18:45:33.161 HKT [8820] HINT:  Future log output will appear in directory "log".
2020-04-02 18:45:35,211 INFO: Lock owner: pgsql_node119; I am pgsql_node155
2020-04-02 18:45:35,211 INFO: does not have lock
2020-04-02 18:45:35,211 INFO: establishing a new patroni connection to the postgres cluster
2020-04-02 18:45:37,215 INFO: establishing a new patroni connection to the postgres cluster

The content of postgresql logs repeated as follows:

FATAL:  the database system is starting up
LOG:  redo starts at 0/3A000060
LOG:  consistent recovery state reached at 0/3A000140
LOG:  invalid record length at 0/3A000140: wanted 24, got 0
LOG:  database system is ready to accept read only connections
LOG:  started streaming WAL from primary at 0/3A000000 on timeline 40
LOG:  received fast shutdown request
LOG:  aborting any active transactions
FATAL:  terminating connection due to administrator command
FATAL:  terminating walreceiver process due to administrator command
LOG:  shutting down
LOG:  database system is shut down
LOG:  database system was shut down in recovery at 2020-04-02 18:50:20 CST

That meant PostgreSQL on the standby server restarted every 5 seconds!

Here is one of my patroni.yml. Another one is the same except the ip address.

scope: pgsql
namespace: /pgsql/
name: pgsql_node119

restapi:
  listen: 10.19.170.119:8008
  connect_address: 10.19.170.119:8008
 
etcd:
  host: 10.19.170.119:2379
 
bootstrap:
  # this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
  # and all other cluster members will use it as a `global configuration`
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    master_start_timeout: 300
    synchronous_mode: false
    # check_timeline: true
    postgresql:
      use_pg_rewind: true
      use_slots: true
 
postgresql:
  listen: 0.0.0.0:5433
  connect_address: 10.19.170.119:5433
  data_dir: "/opt/postgresql-11/data"
  bin_dir: "/opt/postgresql-11/bin"
#  config_dir: /etc/postgresql/9.6/main
  authentication:
    replication:
      username: repuser
      password: repuserpwd
    superuser:
      username: postgres
      password: postgrespwd
 
#watchdog:
#  mode: automatic # Allowed values: off, automatic, required
#  device: /dev/watchdog
#  safety_margin: 5
 
tags:
    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

Do you know why this happened or how to solve this problem? Thank you.

Is there any message in the Windows logs for the postgreSQL database service ? — pifor, Apr 02 '20 at 13:35

score 0 · Answer 1 · answered May 20 '21 at 12:31

I found the similar problem in this issue of Patroni on github.

The immediate cause of the problem that Patroni on the standby server retart PostgreSQL so frequently is that the password file required by Patroni is modified by another process. There was an important parameter of postgresql, pgpass, that was not configured in my patroni.yml. The default value is "$HOME/.pgpass" ($HOME is the home directory of user "postgres").

However, the file "$HOME/.pgpass" may be modified by other applications. This will make Patroni service abnormal.

The solution is to set pgpass to the path of a .pgpass only accessed by Patroni. For example,

pgpass: /etc/patroni/.pgpass

The PostgreSQL service on standby node starts and stops again and again after Patroni started up

1 Answers1