Recovering from PostgreSQL-HA Failure on Kubernetes
I am writing this article so that others do not have to go through the pain I just experienced, after I rebooted a couple of Kubernetes nodes because the time had drifted substantially on all of them.
We have a project running on Kubernetes that is composed of many services all sharing a central, highly available PostgreSQL database. We first detected the time drift issue when gRPC on Gitlab wasn’t working. Once we corrected the NTP settings, I issued a reboot on the nodes and that is where our story begins…
Disclaimer!
I am in no way a PostgreSQL expert. If there is a better way of solving this problem, please share, as I couldn’t find the answer.
Because we have a central PostgreSQL that multiple services use, we opted to deploy the the Bitnami PostgreSQL-HA helm chart. It was quick and simple and I didn’t take to much time digging into it. Once deployed it provides the following pods
postgresql-ha-pgpool-74c8574dbf-skdwx
postgresql-ha-postgresql-0
postgresql-ha-postgresql-1
With the following kubernetes services
postgresql-ha-pgpool
postgresql-ha-postgresql
postgresql-ha-postgresql-headless
After I rebooted the nodes, all 3 PostgreSQL pods where going into CrashLoopBackOff. When I looked at the logs, I saw the following error
postgresql-repmgr 18:20:05.54 INFO ==> Starting PostgreSQL in background...
waiting for server to start....2020-09-16 18:20:05.612 GMT [120] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-09-16 18:20:05.613 GMT [120] LOG: listening on IPv6 address "::", port 5432
2020-09-16 18:20:05.616 GMT [120] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-09-16 18:20:05.640 GMT [120] LOG: redirecting log output to logging collector process
2020-09-16 18:20:05.640 GMT [120] HINT: Future log output will appear in directory "/opt/bitnami/postgresql/logs".
2020-09-16 18:20:05.643 GMT [122] LOG: database system was interrupted while in recovery at log time 2020-09-16 15:04:55 GMT
2020-09-16 18:20:05.643 GMT [122] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2020-09-16 18:20:05.769 GMT [122] LOG: entering standby mode
2020-09-16 18:20:05.782 GMT [122] LOG: redo starts at 0/24000028
2020-09-16 18:20:05.782 GMT [122] LOG: consistent recovery state reached at 0/25000000
2020-09-16 18:20:05.783 GMT [120] LOG: database system is ready to accept read only connections
2020-09-16 18:20:05.793 GMT [126] LOG: started streaming WAL from primary at 0/25000000 on timeline 4
done
server started
postgresql-repmgr 18:20:05.87 INFO ==> ** Starting repmgrd **
[2020-09-16 18:20:05] [NOTICE] repmgrd (repmgrd 5.1.0) starting up
[2020-09-16 18:20:05] [ERROR] this node is marked as inactive and cannot be used as a failover target
[2020-09-16 18:20:05] [HINT] Check that "repmgr (primary|standby) register" was executed for this node
So the Primary node has entered Standby Mode. Not a problem! (or so I thought), I will quickly repair a PostgreSQL database, inside of a pod that is crashing, and run whatever command I need to run to make it cool again. How difficult could that be?
I will not bore you with what I tried, I will simply tell you what I ended up doing.
1) Prevent the pod from CrashLoopBackOff
Edit the statefuleset
kubectl edit statefulsets.apps postgresql-ha-postgresql
And put in a sleep command
spec:
containers:
- command: ['sh', '-c', 'echo The app is running! && sleep 3600']
env:
- name: BITNAMI_DEBUG
value: "true"
- name: POSTGRESQL_VOLUME_DIR
value: /bitnami/postgresql
- name: PGDATA
value: /bitnami/postgresql/data
- name: POSTGRES_USER
value: postgres
This will make the pod sleep for 3600 second and override the CMD and Entrypoint from the container image.
2) Remove Health Checks
Remaining in that statefulset, remove the livenessProbe
livenessProbe:
exec:
command:
- sh
- -c
- PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h
127.0.0.1 -c "SELECT 1"
failureThreshold: 6
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
And readinessProbe
readinessProbe:
exec:
command:
- sh
- -c
- PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h
127.0.0.1 -c "SELECT 1"
failureThreshold: 6
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
If you do not do this, even though the pod is sleeping, it will restart because of the livenessProbe and will not accept traffic because of the readinessProbe.
Save and quit
3) Become the Pod
With those changes made, you will need to manually delete the pod
kubectl delete po postgresql-ha-postgresql-0
And once it has started up, you can shell into it
kubectl exec postgresql-ha-postgresql-0 -it -- /bin/bash
4) Fix what you broke
Once you have shelled into the pod, start PostgreSQL
/opt/bitnami/scripts/postgresql-repmgr/run.sh
With PostgreSQL running, we can actually see what is wrong. Check the status of repmgr
/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show
Here is the error I had
postgresql-repmgr 18:37:20.91
postgresql-repmgr 18:37:20.91 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 18:37:20.92 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql-repmgr
postgresql-repmgr 18:37:20.92 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues
postgresql-repmgr 18:37:20.92
postgresql-repmgr 18:37:20.93 DEBUG ==> Configuring libnss_wrapper...ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
------+----------------------------+---------+----------------------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000 | postgresql-ha-postgresql-0 | primary | ! running as standby | | default | 100 | 4 | user=repmgr password=sMOIs9AZow host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.lsd-automate.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
1001 | postgresql-ha-postgresql-1 | primary | ? unreachable | ? | default | 100 | | user=repmgr password=sMOIs9AZow host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.lsd-automate.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5WARNING: following issues were detected
- node "postgresql-ha-postgresql-0" (ID: 1000) is registered as an inactive primary but running as standby
- unable to connect to node "postgresql-ha-postgresql-1" (ID: 1001)
- node "postgresql-ha-postgresql-1" (ID: 1001) is registered as an active primary but is unreachableHINT: execute with --verbose option to see connection error messages
That wrapping above is terrible, so here is a screenshot
And here is the chicken and egg / dumb and dumber part. The first pod will not start, because it is running as Standby and because it is a statefulset the second pod never starts up because it is waiting for the first pod to be ready.
“Insert Nordic curse words here”
We can keep digging while we are here and see how the node is doing
/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf node check
And that returned the following
postgresql-repmgr 18:41:05.32
postgresql-repmgr 18:41:05.32 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 18:41:05.32 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql-repmgr
postgresql-repmgr 18:41:05.32 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues
postgresql-repmgr 18:41:05.33
postgresql-repmgr 18:41:05.33 DEBUG ==> Configuring libnss_wrapper...Node "postgresql-ha-postgresql-0":
Server role: CRITICAL (node is registered as primary but running as standby)
Replication lag: OK (0 seconds)
WAL archiving: OK (0 pending archive ready files)
Upstream connection: CRITICAL (node "postgresql-ha-postgresql-0" (ID: 1000) is a standby but no upstream record found)
Downstream servers: OK (this node has no downstream nodes)
Replication slots: OK (node has no physical replication slots)
Missing physical replication slots: OK (node has no missing physical replication slots)
Configured data directory: OK (configured "data_directory" is "/bitnami/postgresql/data")
Ok, confirmed, we have issues, stop messing about and run the repair command. Let’s promote this server from Standby to Primary.
/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf standby promote
Which gives the output
postgresql-repmgr 18:42:13.44
postgresql-repmgr 18:42:13.44 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 18:42:13.44 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql-repmgr
postgresql-repmgr 18:42:13.45 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues
postgresql-repmgr 18:42:13.45
postgresql-repmgr 18:42:13.46 DEBUG ==> Configuring libnss_wrapper...NOTICE: promoting standby to primary
DETAIL: promoting server "postgresql-ha-postgresql-0" (ID: 1000) using "/opt/bitnami/postgresql/bin/pg_ctl -w -D '/bitnami/postgresql/data' promote"
waiting for server to promote..... done
server promoted
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "postgresql-ha-postgresql-0" (ID: 1000) was successfully promoted to primary
[REPMGR EVENT] Node id: 1000; Event type: standby_promote; Success [1|0]: 1; Time: 2020-09-16 18:42:14.771401+00; Details: server "postgresql-ha-postgresql-0" (ID: 1000) was successfully promoted to primary
Looking for the script: /opt/bitnami/repmgr/events/execs/standby_promote.sh
[REPMGR EVENT] will execute script '/opt/bitnami/repmgr/events/execs/standby_promote.sh' for the event
[REPMGR EVENT::standby_promote] Node id: 1000; Event type: standby_promote; Success [1|0]: 1; Time: 2020-09-16 18:42:14.771401+00; Details: server "postgresql-ha-postgresql-0" (ID: 1000) was successfully promoted to primary
[REPMGR EVENT::standby_promote] Locking primary...
[REPMGR EVENT::standby_promote] Unlocking standby...
And now if we look at the cluster it looks a lot better
/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show
Which shows us
postgresql-repmgr 18:42:28.78
postgresql-repmgr 18:42:28.78 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 18:42:28.79 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql-repmgr
postgresql-repmgr 18:42:28.79 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues
postgresql-repmgr 18:42:28.79
postgresql-repmgr 18:42:28.80 DEBUG ==> Configuring libnss_wrapper...ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
------+----------------------------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000 | postgresql-ha-postgresql-0 | primary | * running | | default | 100 | 5 | user=repmgr password=sMOIs9AZow host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.lsd-automate.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
1001 | postgresql-ha-postgresql-1 | primary | - failed | ? | default | 100 | | user=repmgr password=sMOIs9AZow host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.lsd-automate.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5WARNING: following issues were detected
- unable to connect to node "postgresql-ha-postgresql-1" (ID: 1001)HINT: execute with --verbose option to see connection error messages
5) Return everything back the way you found it
Edit your statefulset again
kubectl edit statefulsets.apps postgresql-ha-postgresql
And remove
command: ['sh', '-c', 'echo The app is running! && sleep 3600']
And return livenessProbe
livenessProbe:
exec:
command:
- sh
- -c
- PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h
127.0.0.1 -c "SELECT 1"
failureThreshold: 6
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
And return readinessProbe
readinessProbe:
exec:
command:
- sh
- -c
- PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h
127.0.0.1 -c "SELECT 1"
failureThreshold: 6
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
Finally, restarting your pods
kubectl delete po postgresql-ha-postgresql-1 postgresql-ha-postgresql-0 postgresql-ha-pgpool-74c8574dbf-vnjtb
I hope this helps someone else in the future so you don’t have to go through what I did. And as always if there is a better way to do this, add a comment.
What is LSD?
If you saw the words “LSD” and were curious….well….
LSD is an open source focused company that helps companies along their journey into the cloud, or what we like to call the LSD Trip.