TL;DR: Un clúster de repmgr maneja la conmutación por error automática, pero las aplicaciones aún necesitan saber qué nodo es la instancia principal actual.
Keepalived resuelve esto con una IP Virtual (VIP) flotante que se mueve automáticamente al nodo que ostenta el rol principal.
Esta guía agrega un VIP a un clúster existente de PostgreSQL 18 + repmgr en Ubuntu 24.04 usando Keepalived 2.x.
Cada paso se ha ejecutado en tiempo real en un clúster real y la salida ha sido verificada.
Your repmgr cluster fails over in 60 seconds.
Your application still points at the old primary’s IP.
A floating VIP solves this: one stable address that always connects to the current primary, regardless of which physical node that is.
Keepalived implements this using VRRP — a standard protocol designed for exactly this purpose.
This guide adds a VIP to a working two-node PostgreSQL + repmgr cluster.
If you do not have that cluster yet, follow the repmgr setup guide first, then come back here.
Índice
How It Works
Keepalived runs a health check script on each node every 2 seconds.
The script connects to the local PostgreSQL instance via Unix socket and queries pg_is_in_recovery().
If the node is the primary (the function returns f), the script exits 0 — Keepalived keeps or claims the VIP.
If the node is a standby or PostgreSQL is unreachable (the function returns t or the connection fails), the script exits 1 — Keepalived releases the VIP.
The node with the highest effective priority holds the VIP.
Server1 has a base priority of 100, server2 has a base priority of 90.
The health check script is configured with a weight of -50.
When a node’s script fails, its effective priority drops by 50: server1 drops from 100 to 50, server2 from 90 to 40.
The primary node always wins — its script succeeds and its effective priority stays at its base value.
El medioambiente
Prerequisites: a working two-node PostgreSQL 18 + repmgr cluster.
| Anfitrión | PI | Rol |
|---|---|---|
| server1 (Ubuntu 24.04) | 192.168.0.181 | PostgreSQL node |
| server2 (Ubuntu 24.04) | 192.168.0.182 | PostgreSQL node |
| VIP | 192.168.0.180 | Floats to the current primary |
Step 1 — Install Keepalived on Both Servers
En server1:
# On server1
sudo apt update
sudo apt install -y keepalived
keepalived --version
# Expected: Keepalived v2.x.x
En server2:
# On server2
sudo apt update
sudo apt install -y keepalived
keepalived --version
Step 2 — Create the Health Check Script on Both Servers
The script connects to the local PostgreSQL instance via Unix socket and checks whether the node is the primary.
It exits 0 on the primary, 1 on a standby or if PostgreSQL is unreachable.
One pitfall matters here: Keepalived runs this script as the postgres OS user, configured via script_user postgres postgres en global_defs.
Because the script already runs as postgres, call psql directly — do not use runuser o sudo -u postgres.
runuser requires root and will fail silently when called by a non-root user, causing the script to always exit 1 on both nodes and the VIP to behave incorrectly.
En server1:
# On server1
sudo tee /usr/local/bin/check_postgres_primary.sh > /dev/null << 'EOF'
#!/bin/bash
result=$(psql -t -c "SELECT pg_is_in_recovery();" 2>/dev/null | tr -d '[:space:]')
[ "$result" = "f" ]
EOF
# The script must be executable — Keepalived will not run it otherwise
sudo chmod +x /usr/local/bin/check_postgres_primary.sh
Test the script manually as the postgres user:
# On server1
sudo -u postgres /usr/local/bin/check_postgres_primary.sh
echo $?
# Expected: 0 if server1 is currently the primary, 1 if it is a standby
# Always test as postgres — running as root will give a different result
En server2:
# On server2
sudo tee /usr/local/bin/check_postgres_primary.sh > /dev/null << 'EOF'
#!/bin/bash
result=$(psql -t -c "SELECT pg_is_in_recovery();" 2>/dev/null | tr -d '[:space:]')
[ "$result" = "f" ]
EOF
sudo chmod +x /usr/local/bin/check_postgres_primary.sh
sudo -u postgres /usr/local/bin/check_postgres_primary.sh
echo $?
# Expected: 1 — server2 is currently a standby
Step 3 — Configure Keepalived on Both Servers
Back up the default config on both servers before overwriting:
sudo cp /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf.20260519 2>/dev/null || true
server1 configuration (priority 100)
# On server1
sudo tee /etc/keepalived/keepalived.conf > /dev/null << 'EOF'
global_defs {
# Run health check scripts as the postgres OS user
# This allows the script to connect via Unix socket using peer authentication
script_user postgres postgres
enable_script_security
}
vrrp_script check_postgres {
script "/usr/local/bin/check_postgres_primary.sh"
# Run the check every 2 seconds
interval 2
# If the script fails, subtract 50 from this node's effective priority
# server1 base priority is 100 — on failure it drops to 50, losing to server2 (base 90)
weight -50
# Number of consecutive failures before declaring the script failed
fall 2
# Number of consecutive successes before declaring the script recovered
rise 2
}
vrrp_instance VI_POSTGRES {
state BACKUP
interface enp0s3
virtual_router_id 51
# Base priority — must differ between nodes; server1 is preferred primary candidate
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass pg_vip_2026
}
virtual_ipaddress {
192.168.0.180/24
}
track_script {
check_postgres
}
}
EOF
server2 configuration (priority 90)
# On server2
sudo tee /etc/keepalived/keepalived.conf > /dev/null << 'EOF'
global_defs {
script_user postgres postgres
enable_script_security
}
vrrp_script check_postgres {
script "/usr/local/bin/check_postgres_primary.sh"
interval 2
# server2 base priority is 90 — on failure it drops to 40
weight -50
fall 2
rise 2
}
vrrp_instance VI_POSTGRES {
state BACKUP
interface enp0s3
virtual_router_id 51
# Lower base priority than server1 — server1 holds the VIP when both are healthy
priority 90
advert_int 1
authentication {
auth_type PASS
auth_pass pg_vip_2026
}
virtual_ipaddress {
192.168.0.180/24
}
track_script {
check_postgres
}
}
EOF
Both nodes use state BACKUP.
Keepalived elects the master dynamically based on effective priority — there is no need to set one node to state MASTER.
Step 4 — Start Keepalived and Verify the VIP
En server1:
# On server1
sudo systemctl enable keepalived
sudo systemctl start keepalived
sudo systemctl status keepalived
# Expected: active (running)
# If failed: sudo journalctl -u keepalived -n 30
En server2:
# On server2
sudo systemctl enable keepalived
sudo systemctl start keepalived
sudo systemctl status keepalived
# Expected: active (running)
Allow 5–10 seconds after startup for the health checks to stabilise, then verify the VIP is on the current primary.
En server1:
# On server1
ip addr show enp0s3 | grep 192.168.0.180
# Expected: inet 192.168.0.180/24 — VIP is present if server1 is the current primary
# If not present and server1 IS the primary: wait 10 seconds and retry
En server2:
# On server2
ip addr show enp0s3 | grep 192.168.0.180
# Expected: no output — server2 is standby and does not hold the VIP
Add the VIP to .pgpass on both servers so repmgr can connect through it without a password prompt:
# On server1
sudo -u postgres bash -c 'echo "192.168.0.180:5432:repmgr:repmgr:repmgr" >> /var/lib/postgresql/.pgpass'
sudo -u postgres bash -c 'echo "192.168.0.180:5432:replication:repmgr:repmgr" >> /var/lib/postgresql/.pgpass'
# On server2
sudo -u postgres bash -c 'echo "192.168.0.180:5432:repmgr:repmgr:repmgr" >> /var/lib/postgresql/.pgpass'
sudo -u postgres bash -c 'echo "192.168.0.180:5432:replication:repmgr:repmgr" >> /var/lib/postgresql/.pgpass'
Verify the VIP is reachable and connects to the primary:
# On server1 or server2
ping -c 3 192.168.0.180
# Expected: replies from 192.168.0.180
# Connect to PostgreSQL via VIP — must run as postgres OS user to use .pgpass
sudo -u postgres psql -h 192.168.0.180 -U repmgr -d repmgr -c "SELECT pg_is_in_recovery(), inet_server_addr();"
# Expected: f (false) | 192.168.0.180 — connected to the primary through the VIP
Step 5 — Test VIP Failover
Note which node currently holds the VIP:
# On server1
ip addr show enp0s3 | grep 192.168.0.180
# Record which node holds the VIP before triggering the failover
Stop PostgreSQL on the primary to trigger automatic failover:
# On server1
sudo systemctl stop postgresql
# repmgrd on server2 will detect this and promote server2 after ~60 seconds
Watch Keepalived on server2 pick up the VIP:
# On server2
sudo journalctl -u keepalived -f
# Expected sequence:
# Script check_postgres_primary.sh succeeded — server2 now primary after promotion
# VRRP_Instance(VI_POSTGRES) Entering MASTER STATE
Verify the VIP has moved:
# On server2
ip addr show enp0s3 | grep 192.168.0.180
# Expected: inet 192.168.0.180/24 — VIP is now on server2
psql -h 192.168.0.180 -U repmgr -d repmgr -c "SELECT pg_is_in_recovery(), inet_server_addr();"
# Expected: f | 192.168.0.182 — VIP now connects to server2, which is the new primary
Step 6 — Test VIP Switchover
A clean switchover also moves the VIP.
The old primary becomes a standby — its health check script fails — the VIP moves to the new primary.
Before running a switchover, re-integrate the failed node from the previous test as a standby.
On the standby (the node that will become the new primary):
# On server1 (assuming server1 is currently standby)
sudo -u postgres repmgr standby switchover
During switchover, repmgr stops the old primary via SSH but does not restart it automatically.
Start PostgreSQL manually on the demoted node when the switchover output shows “waiting for node X to connect”:
# On the demoted node (the old primary)
sudo systemctl start postgresql
Verify the VIP moved back to server1:
# On server1
ip addr show enp0s3 | grep 192.168.0.180
# Expected: inet 192.168.0.180/24 — VIP is back on server1
psql -h 192.168.0.180 -U repmgr -d repmgr -c "SELECT pg_is_in_recovery(), inet_server_addr();"
# Expected: f | 192.168.0.181 — VIP connects to server1, now the primary
Preguntas frecuentes
Why does the health check script use psql directly instead of sudo -u postgres psql?
Keepalived runs the script as the postgres OS user via script_user postgres postgres en global_defs.
The script already runs as postgres, so calling psql directly connects via Unix socket using peer authentication.
Utilizando runuser o sudo -u postgres inside the script will fail silently — both require root, and the script is not running as root.
The result is that both nodes always exit 1, and neither holds the VIP correctly.
Why do both nodes use state BACKUP instead of one using state MASTER?
With a weight-based health check, the VRRP master is determined dynamically by effective priority, not by the static estado directive.
If one node is set to state MASTER, it will claim the VIP at startup regardless of the health check result, causing a race condition during the first few seconds.
Utilizando state BACKUP on both nodes lets the health check script determine which node should hold the VIP from the start.
How long does it take for the VIP to move after a failover?
The VIP moves once two conditions are met: repmgrd has promoted the standby to primary (approximately 60 seconds with default settings), and Keepalived's health check has confirmed the promotion (up to fall 2 × interval 2 = 4 seconds).
Total time from primary failure to VIP moving is approximately 60–70 seconds with the configuration in this guide.
Does the VIP move during a planned switchover?
Sí.
When repmgr standby switchover demotes the old primary, that node's health check script starts returning 1 (because the node is now a standby).
Keepalived detects the change within fall 2 × interval 2 = 4 seconds and moves the VIP to the new primary.
What is VRRP and why is it used here?
VRRP (Virtual Router Redundancy Protocol) is a standard network protocol (RFC 5798) designed to provide automatic assignment of IP routers to participating hosts.
Keepalived implements VRRP to manage the floating VIP — multiple nodes participate in a VRRP group and elect a master based on priority.
The master holds the VIP; if the master fails or its priority drops below another node's, a new master is elected and the VIP moves.
En resumen
Keepalived adds a floating VIP to an existing PostgreSQL + repmgr cluster with minimal configuration.
The health check script is the critical component: it must run as the postgres OS user and call psql directly — not through runuser o sudo.
With fall 2 y interval 2, the VIP moves within 4 seconds of Keepalived detecting the role change, which happens automatically after repmgrd promotes the standby.
If you are designing a PostgreSQL high-availability architecture and want a second opinion before going to production, ponerse en contacto
