Recovering from PostgreSQL-HA Failure on Kubernetes

postgresql-ha-pgpool-74c8574dbf-skdwx
postgresql-ha-postgresql-0
postgresql-ha-postgresql-1
postgresql-ha-pgpool
postgresql-ha-postgresql
postgresql-ha-postgresql-headless
postgresql-repmgr 18:20:05.54 INFO  ==> Starting PostgreSQL in background...
waiting for server to start....2020-09-16 18:20:05.612 GMT [120] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-09-16 18:20:05.613 GMT [120] LOG: listening on IPv6 address "::", port 5432
2020-09-16 18:20:05.616 GMT [120] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-09-16 18:20:05.640 GMT [120] LOG: redirecting log output to logging collector process
2020-09-16 18:20:05.640 GMT [120] HINT: Future log output will appear in directory "/opt/bitnami/postgresql/logs".
2020-09-16 18:20:05.643 GMT [122] LOG: database system was interrupted while in recovery at log time 2020-09-16 15:04:55 GMT
2020-09-16 18:20:05.643 GMT [122] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2020-09-16 18:20:05.769 GMT [122] LOG: entering standby mode
2020-09-16 18:20:05.782 GMT [122] LOG: redo starts at 0/24000028
2020-09-16 18:20:05.782 GMT [122] LOG: consistent recovery state reached at 0/25000000
2020-09-16 18:20:05.783 GMT [120] LOG: database system is ready to accept read only connections
2020-09-16 18:20:05.793 GMT [126] LOG: started streaming WAL from primary at 0/25000000 on timeline 4
done
server started
postgresql-repmgr 18:20:05.87 INFO ==> ** Starting repmgrd **
[2020-09-16 18:20:05] [NOTICE] repmgrd (repmgrd 5.1.0) starting up
[2020-09-16 18:20:05] [ERROR] this node is marked as inactive and cannot be used as a failover target
[2020-09-16 18:20:05] [HINT] Check that "repmgr (primary|standby) register" was executed for this node

1) Prevent the pod from CrashLoopBackOff

Edit the statefuleset

kubectl edit statefulsets.apps postgresql-ha-postgresql
spec:
containers:
- command: ['sh', '-c', 'echo The app is running! && sleep 3600']
env:
- name: BITNAMI_DEBUG
value: "true"
- name: POSTGRESQL_VOLUME_DIR
value: /bitnami/postgresql
- name: PGDATA
value: /bitnami/postgresql/data
- name: POSTGRES_USER
value: postgres

2) Remove Health Checks

Remaining in that statefulset, remove the livenessProbe

livenessProbe:
exec:
command:
- sh
- -c
- PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h
127.0.0.1 -c "SELECT 1"
failureThreshold: 6
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
readinessProbe:
exec:
command:
- sh
- -c
- PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h
127.0.0.1 -c "SELECT 1"
failureThreshold: 6
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5

3) Become the Pod

With those changes made, you will need to manually delete the pod

kubectl delete po postgresql-ha-postgresql-0
kubectl exec postgresql-ha-postgresql-0 -it -- /bin/bash

4) Fix what you broke

/opt/bitnami/scripts/postgresql-repmgr/run.sh
/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show
postgresql-repmgr 18:37:20.91 
postgresql-repmgr 18:37:20.91 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 18:37:20.92 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql-repmgr
postgresql-repmgr 18:37:20.92 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues
postgresql-repmgr 18:37:20.92
postgresql-repmgr 18:37:20.93 DEBUG ==> Configuring libnss_wrapper...
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
------+----------------------------+---------+----------------------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000 | postgresql-ha-postgresql-0 | primary | ! running as standby | | default | 100 | 4 | user=repmgr password=sMOIs9AZow host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.lsd-automate.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
1001 | postgresql-ha-postgresql-1 | primary | ? unreachable | ? | default | 100 | | user=repmgr password=sMOIs9AZow host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.lsd-automate.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
WARNING: following issues were detected
- node "postgresql-ha-postgresql-0" (ID: 1000) is registered as an inactive primary but running as standby
- unable to connect to node "postgresql-ha-postgresql-1" (ID: 1001)
- node "postgresql-ha-postgresql-1" (ID: 1001) is registered as an active primary but is unreachable
HINT: execute with --verbose option to see connection error messages
/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf node check
postgresql-repmgr 18:41:05.32 
postgresql-repmgr 18:41:05.32 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 18:41:05.32 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql-repmgr
postgresql-repmgr 18:41:05.32 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues
postgresql-repmgr 18:41:05.33
postgresql-repmgr 18:41:05.33 DEBUG ==> Configuring libnss_wrapper...
Node "postgresql-ha-postgresql-0":
Server role: CRITICAL (node is registered as primary but running as standby)
Replication lag: OK (0 seconds)
WAL archiving: OK (0 pending archive ready files)
Upstream connection: CRITICAL (node "postgresql-ha-postgresql-0" (ID: 1000) is a standby but no upstream record found)
Downstream servers: OK (this node has no downstream nodes)
Replication slots: OK (node has no physical replication slots)
Missing physical replication slots: OK (node has no missing physical replication slots)
Configured data directory: OK (configured "data_directory" is "/bitnami/postgresql/data")
/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf standby promote
postgresql-repmgr 18:42:13.44 
postgresql-repmgr 18:42:13.44 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 18:42:13.44 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql-repmgr
postgresql-repmgr 18:42:13.45 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues
postgresql-repmgr 18:42:13.45
postgresql-repmgr 18:42:13.46 DEBUG ==> Configuring libnss_wrapper...
NOTICE: promoting standby to primary
DETAIL: promoting server "postgresql-ha-postgresql-0" (ID: 1000) using "/opt/bitnami/postgresql/bin/pg_ctl -w -D '/bitnami/postgresql/data' promote"
waiting for server to promote..... done
server promoted
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "postgresql-ha-postgresql-0" (ID: 1000) was successfully promoted to primary
[REPMGR EVENT] Node id: 1000; Event type: standby_promote; Success [1|0]: 1; Time: 2020-09-16 18:42:14.771401+00; Details: server "postgresql-ha-postgresql-0" (ID: 1000) was successfully promoted to primary
Looking for the script: /opt/bitnami/repmgr/events/execs/standby_promote.sh
[REPMGR EVENT] will execute script '/opt/bitnami/repmgr/events/execs/standby_promote.sh' for the event
[REPMGR EVENT::standby_promote] Node id: 1000; Event type: standby_promote; Success [1|0]: 1; Time: 2020-09-16 18:42:14.771401+00; Details: server "postgresql-ha-postgresql-0" (ID: 1000) was successfully promoted to primary
[REPMGR EVENT::standby_promote] Locking primary...
[REPMGR EVENT::standby_promote] Unlocking standby...
/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show
postgresql-repmgr 18:42:28.78 
postgresql-repmgr 18:42:28.78 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 18:42:28.79 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql-repmgr
postgresql-repmgr 18:42:28.79 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues
postgresql-repmgr 18:42:28.79
postgresql-repmgr 18:42:28.80 DEBUG ==> Configuring libnss_wrapper...
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
------+----------------------------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000 | postgresql-ha-postgresql-0 | primary | * running | | default | 100 | 5 | user=repmgr password=sMOIs9AZow host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.lsd-automate.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
1001 | postgresql-ha-postgresql-1 | primary | - failed | ? | default | 100 | | user=repmgr password=sMOIs9AZow host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.lsd-automate.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
WARNING: following issues were detected
- unable to connect to node "postgresql-ha-postgresql-1" (ID: 1001)
HINT: execute with --verbose option to see connection error messages

5) Return everything back the way you found it

Edit your statefulset again

kubectl edit statefulsets.apps postgresql-ha-postgresql
command: ['sh', '-c', 'echo The app is running! && sleep 3600']
livenessProbe:
exec:
command:
- sh
- -c
- PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h
127.0.0.1 -c "SELECT 1"
failureThreshold: 6
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
readinessProbe:
exec:
command:
- sh
- -c
- PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h
127.0.0.1 -c "SELECT 1"
failureThreshold: 6
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
kubectl delete po postgresql-ha-postgresql-1 postgresql-ha-postgresql-0 postgresql-ha-pgpool-74c8574dbf-vnjtb

What is LSD?

If you saw the words “LSD” and were curious….well….

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Neil White

Neil White

28 Followers

Mech Warrior Overlord @ LSD. I spend my days killing Kubernetes, operating Openshift, hollering at Helm, vanquishing Vaults and conquering Clouds!