Configuración de repmgr en PostgreSQL para replicación con conmutación por error automática

PostgreSQL does not include automatic failover out of the box.
When the primary goes down, someone has to promote the standby manually — which means downtime.
repmgr adds an automatic failover daemon (repmgrd) that monitors the cluster and promotes the standby within seconds when the primary fails.
This guide walks through setting up a two-node PostgreSQL 18 cluster with streaming replication and automatic failover on Ubuntu 24.04, using repmgr 5.x.
Every step has been run live on a real cluster and the output verified.


A streaming standby that requires manual promotion is not a high-availability setup — it is a disaster recovery setup.

The difference matters during an incident at 3 AM.

PostgreSQL ships with the building blocks for replication, but the automatic failover logic lives in a separate tool.

repmgr is the tool most PostgreSQL DBAs reach for first: it is lightweight, well-documented, and integrates cleanly with systemd.

This guide builds the full stack: streaming replication from scratch, repmgrd running as a daemon, and a tested automatic failover that recovers the cluster without human intervention.



El medioambiente

This guide uses two Ubuntu 24.04 servers on the same subnet.

HostPIInitial role
server1192.168.0.181Primary
server2192.168.0.182Standby

PostgreSQL 18 and repmgr 5.5.0 are installed from the PGDG repository on both servers.


Step 1 — Install PostgreSQL 18 and repmgr

PostgreSQL 18 is not in the Ubuntu 24.04 default repository.
En postgresql-common package includes an official script that adds the PGDG APT repository and signing key automatically.

On both servers:

sudo apt install -y postgresql-common
sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh

sudo apt update
sudo apt install -y postgresql-18 postgresql-18-repmgr

Verifique la instalación:

psql --version
# Expected: psql (PostgreSQL) 18.x

repmgr --version
# Expected: repmgr 5.x

On server2, stop PostgreSQL and remove the default data directory.
repmgr will clone the primary’s data directory to server2 in a later step — if a data directory already exists, repmgr standby clone will refuse to proceed.

# On server2
sudo systemctl stop postgresql
sudo rm -rf /var/lib/postgresql/18/main

Step 2 — Configure PostgreSQL for Streaming Replication

These parameters must be set in /etc/postgresql/18/main/postgresql.conf on both servers before setting up replication.
repmgr requires wal_log_hints y shared_preload_libraries — without them, the node rejoin process (pg_rewind) and the repmgrd daemon will not work.

On both servers, edit /etc/postgresql/18/main/postgresql.conf:

listen_addresses = ‘*' wal_level = replica max_wal_senders = 10 max_replication_slots = 10 hot_standby = on wal_log_hints = on shared_preload_libraries = ‘repmgr'

These parameters require a PostgreSQL restart to take effect.
Do not restart yet — add the pg_hba.conf entries first so that you do not restart twice.


Step 3 — Allow Replication Connections in pg_hba.conf

On both servers, append these lines to /etc/postgresql/18/main/pg_hba.conf.
These allow the repmgr user to connect for both management queries and WAL streaming from either node.

sudo tee -a /etc/postgresql/18/main/pg_hba.conf > /dev/null << 'EOF'

host    repmgr          repmgr          192.168.0.181/32        scram-sha-256
host    repmgr          repmgr          192.168.0.182/32        scram-sha-256
host    replication     repmgr          192.168.0.181/32        scram-sha-256
host    replication     repmgr          192.168.0.182/32        scram-sha-256
EOF

Now restart PostgreSQL on server1 (server2 has no data directory yet):

# On server1
sudo systemctl restart postgresql

Step 4 — Create the repmgr User and Database

Run these commands on server1 (the primary) only.
The standby will receive the repmgr schema through replication in a later step.

sudo -u postgres psql -c "CREATE USER repmgr WITH SUPERUSER REPLICATION LOGIN PASSWORD 'repmgr';"
sudo -u postgres psql -c "CREATE DATABASE repmgr OWNER repmgr;"

Añadir un .pgpass file for the postgres OS user on both servers so that repmgr can connect without a password prompt.

On server1:

sudo -u postgres bash -c 'cat > /var/lib/postgresql/.pgpass <<EOF
192.168.0.181:5432:repmgr:repmgr:repmgr
192.168.0.182:5432:repmgr:repmgr:repmgr
192.168.0.181:5432:replication:repmgr:repmgr
192.168.0.182:5432:replication:repmgr:repmgr
EOF'
sudo chmod 600 /var/lib/postgresql/.pgpass

On server2:

sudo -u postgres bash -c 'cat > /var/lib/postgresql/.pgpass <<EOF
192.168.0.181:5432:repmgr:repmgr:repmgr
192.168.0.182:5432:repmgr:repmgr:repmgr
192.168.0.181:5432:replication:repmgr:repmgr
192.168.0.182:5432:replication:repmgr:repmgr
EOF'
sudo chmod 600 /var/lib/postgresql/.pgpass

Step 5 — Configure repmgr on Both Nodes

Create /etc/repmgr.conf on each server.
En pg_bindir parameter is required on Ubuntu — the pg_rewind binary is not in the default PATH for the postgres OS user, and repmgr needs it during node rejoin after failover.
En service_start_command y service_stop_command parameters are required: without service_stop_command, planned switchovers fail with “primary shutdown could not be confirmed”; without service_start_command, a node cannot rejoin the cluster automatically after being rewound.

On server1:

sudo tee /etc/repmgr.conf > /dev/null << 'EOF'
node_id=1
node_name='server1'
conninfo='host=192.168.0.181 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/18/main'

failover=automatic
promote_command='repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='repmgr standby follow -f /etc/repmgr.conf --upstream-node-id=%n --log-to-file'

use_replication_slots=yes
monitoring_history=yes
log_file='/var/log/repmgr/repmgr.log'

pg_bindir='/usr/lib/postgresql/18/bin'
service_start_command='sudo systemctl start postgresql'
service_stop_command='sudo systemctl stop postgresql'

node_rejoin_timeout=120
standby_reconnect_timeout=120
EOF

sudo chown postgres:postgres /etc/repmgr.conf
sudo chmod 640 /etc/repmgr.conf

On server2, only node_id y node_name differ:

sudo tee /etc/repmgr.conf > /dev/null << 'EOF'
node_id=2
node_name='server2'
conninfo='host=192.168.0.182 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/postgresql/18/main'

failover=automatic
promote_command='repmgr standby promote -f /etc/repmgr.conf --log-to-file'
follow_command='repmgr standby follow -f /etc/repmgr.conf --upstream-node-id=%n --log-to-file'

use_replication_slots=yes
monitoring_history=yes
log_file='/var/log/repmgr/repmgr.log'

pg_bindir='/usr/lib/postgresql/18/bin'
service_start_command='sudo systemctl start postgresql'
service_stop_command='sudo systemctl stop postgresql'

node_rejoin_timeout=120
standby_reconnect_timeout=120
EOF

sudo chown postgres:postgres /etc/repmgr.conf
sudo chmod 640 /etc/repmgr.conf

Create the log directory on both servers:

sudo mkdir -p /var/log/repmgr
sudo chown postgres:postgres /var/log/repmgr

Step 6 — Set Up SSH Keys and Passwordless sudo

Planned switchovers require repmgr to SSH from one node to the other as the postgres OS user and run systemctl stop postgresql.
This step is mandatory — without it, switchovers fail at the point where repmgr tries to stop the current primary remotely.

SSH key setup:

On server1 — generate a key and display the public key:

# On server1
sudo -u postgres ssh-keygen -t ed25519 -N '' -f /var/lib/postgresql/.ssh/id_ed25519
sudo -u postgres cat /var/lib/postgresql/.ssh/id_ed25519.pub

On server2 — authorise server1’s public key:

# On server2
sudo -u postgres mkdir -p /var/lib/postgresql/.ssh
# Paste the public key from server1 in place of <server1-public-key>
sudo -u postgres bash -c 'echo "<server1-public-key>" >> /var/lib/postgresql/.ssh/authorized_keys'
sudo chmod 700 /var/lib/postgresql/.ssh
sudo chmod 600 /var/lib/postgresql/.ssh/authorized_keys
sudo chown -R postgres:postgres /var/lib/postgresql/.ssh

Repeat in the other direction — on server2 generate a key, then authorise it on server1:

# On server2
sudo -u postgres ssh-keygen -t ed25519 -N '' -f /var/lib/postgresql/.ssh/id_ed25519
sudo -u postgres cat /var/lib/postgresql/.ssh/id_ed25519.pub
# On server1
sudo -u postgres mkdir -p /var/lib/postgresql/.ssh
sudo -u postgres bash -c 'echo "<server2-public-key>" >> /var/lib/postgresql/.ssh/authorized_keys'
sudo chmod 700 /var/lib/postgresql/.ssh
sudo chmod 600 /var/lib/postgresql/.ssh/authorized_keys
sudo chown -R postgres:postgres /var/lib/postgresql/.ssh

Verify from each node (type yes if prompted to accept the host key — only the first time):

# On server1
sudo -u postgres ssh postgres@192.168.0.182 "echo OK"
# Expected: OK
# On server2
sudo -u postgres ssh postgres@192.168.0.181 "echo OK"
# Expected: OK

Passwordless sudo — on both servers:

Create /etc/sudoers.d/postgres-repmgr to allow the postgres user to start and stop PostgreSQL without a password.
The path must be /usr/bin/systemctl — on Ubuntu 24.04 this is where systemctl lives, and sudo validates the exact path.
The file must have permissions 440 — sudo silently ignores files with world-readable permissions.

sudo tee /etc/sudoers.d/postgres-repmgr > /dev/null << 'EOF'
postgres ALL=(ALL) NOPASSWD: /usr/bin/systemctl start postgresql, /usr/bin/systemctl stop postgresql, /usr/bin/systemctl restart postgresql
EOF
sudo chmod 440 /etc/sudoers.d/postgres-repmgr

Verify — no password prompt expected:

sudo -u postgres sudo systemctl status postgresql@18-main

Step 7 — Register the Primary and Clone the Standby

On server1, register the running PostgreSQL instance as the primary node:

# On server1
sudo -u postgres repmgr -f /etc/repmgr.conf primary register

On server2, run a dry run first to confirm connectivity:

# On server2
sudo -u postgres repmgr -h 192.168.0.181 -U repmgr -d repmgr \
  -f /etc/repmgr.conf standby clone --dry-run
# Expected: "STANDBY CLONE (target node \"server2\") would complete successfully"

Then run the actual clone:

# On server2
sudo -u postgres repmgr -h 192.168.0.181 -U repmgr -d repmgr \
  -f /etc/repmgr.conf standby clone

On Ubuntu, postgresql.conf y pg_hba.conf live in /etc/postgresql/18/main/ — outside the data directory.
pg_basebackup only clones the data directory, so these config files are not copied automatically.
Copy them from server1 to server2:

# On server1 — stage the files for transfer
sudo cp /etc/postgresql/18/main/postgresql.conf /tmp/postgresql.conf
sudo cp /etc/postgresql/18/main/pg_hba.conf /tmp/pg_hba.conf
sudo chmod 644 /tmp/postgresql.conf /tmp/pg_hba.conf
# On server2 — copy and install
scp fernando@192.168.0.181:/tmp/postgresql.conf /tmp/postgresql.conf
scp fernando@192.168.0.181:/tmp/pg_hba.conf /tmp/pg_hba.conf
sudo cp /tmp/postgresql.conf /etc/postgresql/18/main/postgresql.conf
sudo cp /tmp/pg_hba.conf /etc/postgresql/18/main/pg_hba.conf

Start PostgreSQL on server2 and register it as a standby:

# On server2
sudo systemctl start postgresql
sudo -u postgres repmgr -f /etc/repmgr.conf standby register

Verify the cluster from either node:

sudo -u postgres repmgr -f /etc/repmgr.conf cluster show

Salida esperada:

 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------
 1  | server1 | primary | * running |          | default  | 100      | 1        | host=192.168.0.181 user=repmgr dbname=repmgr connect_timeout=2
 2  | server2 | standby |   running | server1  | default  | 100      | 1        | host=192.168.0.182 user=repmgr dbname=repmgr connect_timeout=2

Step 8 — Start repmgrd for Automatic Failover

repmgrd is the monitoring daemon that triggers automatic failover.
On Ubuntu 24.04, repmgr does not ship with a systemd unit file for repmgrd — start it manually using --daemonize.

On both servers:

sudo -u postgres repmgrd -f /etc/repmgr.conf --daemonize

Verify the daemon is running and not paused:

sudo -u postgres repmgr daemon status

Salida esperada:

 ID | Name    | Role    | Status    | repmgrd Active | PID  | Paused? | Upstream
----+---------+---------+-----------+----------------+------+---------+---------
 1  | server1 | primary | * running | yes            | XXXX | no      | n/a
 2  | server2 | standby |   running | yes            | XXXX | no      | server1

En Paused? no column is critical — repmgrd pauses itself after a failed failover attempt and will not attempt another failover while paused.
Before any failover test, verify both nodes show Paused? no.
If a node shows Paused? yes, run:

sudo -u postgres repmgr daemon unpause

Step 9 — Test Automatic Failover

Stop PostgreSQL on server1 to simulate a primary failure:

# On server1
sudo systemctl stop postgresql

On server2, watch repmgrd react:

# On server2
sudo tail -f -n 50 /var/log/repmgr/repmgr.log

repmgrd waits through a configurable number of reconnect attempts (default: 6 attempts × 10 seconds = ~60 seconds) before promoting the standby.
When promotion completes, the log shows:

NOTICE: promoting standby to primary
NOTICE: STANDBY PROMOTE successful

Verify the new topology:

# On server2
sudo -u postgres repmgr -f /etc/repmgr.conf cluster show

Expected: server2 is now primary, server1 shows as unavailable.


Step 10 — Rejoin the Failed Node as a Standby

After the failover, server1 needs to be reintegrated.
PostgreSQL may have advanced on server2 while server1 was down — the data directories are now diverged.
repmgr uses pg_rewind to sync server1 back to the new primary’s timeline before starting replication.

On server1, verify PostgreSQL is stopped, then run:

# On server1
sudo systemctl status postgresql@18-main
# Expected: inactive (dead) — if active, stop it first: sudo systemctl stop postgresql

sudo -u postgres repmgr node rejoin \
  -d 'host=192.168.0.182 user=repmgr dbname=repmgr' \
  -f /etc/repmgr.conf \
  --force-rewind \
  --verbose

--force-rewind tells repmgr to run pg_rewind regardless of the timeline divergence check.
On Ubuntu 24.04, the rejoin sometimes times out before PostgreSQL finishes starting as a standby.
If the command exits with a timeout message but the log shows pg_rewind completed successfully, start PostgreSQL manually:

sudo systemctl start postgresql

Register server1 in the repmgr metadata and start the repmgrd daemon.
En --force flag is required because server1 was previously registered as the primary:

sudo -u postgres repmgr -f /etc/repmgr.conf standby register --force
sudo -u postgres repmgrd -f /etc/repmgr.conf --daemonize

Verify both nodes are running:

sudo -u postgres repmgr -f /etc/repmgr.conf cluster show

Expected: server1 listed as standby, server2 as primary.


Preguntas frecuentes

Does repmgr work with PostgreSQL 18?

Sí.
repmgr 5.5.0 supports PostgreSQL 18.
Install from the PGDG repository using the postgresql-18-repmgr package.

Why does my planned switchover fail with “primary shutdown could not be confirmed”?

The service_stop_command parameter is not set in /etc/repmgr.conf.
repmgr needs to stop the current primary via SSH during a switchover and requires an explicit command to do so.
Add service_stop_command='sudo systemctl stop postgresql' to /etc/repmgr.conf on both nodes.

Why does repmgrd not trigger failover even though the primary is down?

repmgrd is paused.
The daemon pauses itself after a failed failover attempt to prevent cascading promotion in split-brain scenarios.
Run repmgr daemon status to check the Paused? column, and repmgr daemon unpause to resume monitoring.

Does the demoted node restart automatically after a switchover?

No.
repmgr stops the old primary during switchover but does not restart it.
The official repmgr documentation states: The original primary will be shut down in any case, and will need to be manually reintegrated into the replication cluster.
After the switchover completes, run sudo systemctl start postgresql on the demoted node to bring it back as a standby.

What is pg_rewind and why is it needed for node rejoin?

pg_rewind resynchronises a PostgreSQL data directory that has diverged from the primary timeline.
This happens after failover — the old primary may have committed transactions that were not replicated before it failed.
pg_rewind replaces those diverged blocks with the correct data from the new primary, allowing the node to resume replication without a full base backup.
wal_log_hints = on must be set in postgresql.conf for pg_rewind to work.


En resumen

PostgreSQL streaming replication is built in; automatic failover requires repmgr.
The setup involves six moving parts that must all be correct: PostgreSQL configuration, repmgr configuration, SSH keys between the postgres users, passwordless sudo for service management, replication slots, and repmgrd running and unpaused.
The most common failure points are the sudoers file permissions (must be 440, not 644) and the missing service_stop_command for switchovers.

If you are building a PostgreSQL high-availability cluster for a production environment and want a second opinion on the architecture before you commit, ponerse en contacto

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *