Recovering from PostgreSQL-HA Failure on Kubernetes

Neil White
8 min readSep 16, 2020

I am writing this article so that others do not have to go through the pain I just experienced, after I rebooted a couple of Kubernetes nodes because the time had drifted substantially on all of them.

We have a project running on Kubernetes that is composed of many services all sharing a central, highly available PostgreSQL database. We first detected the time drift issue when gRPC on Gitlab wasn’t working. Once we corrected the NTP settings, I issued a reboot on the nodes and that is where our story begins…

Disclaimer!
I am in no way a PostgreSQL expert. If there is a better way of solving this problem, please share, as I couldn’t find the answer.

Because we have a central PostgreSQL that multiple services use, we opted to deploy the the Bitnami PostgreSQL-HA helm chart. It was quick and simple and I didn’t take to much time digging into it. Once deployed it provides the following pods

postgresql-ha-pgpool-74c8574dbf-skdwx
postgresql-ha-postgresql-0
postgresql-ha-postgresql-1

With the following kubernetes services

postgresql-ha-pgpool
postgresql-ha-postgresql
postgresql-ha-postgresql-headless

After I rebooted the nodes, all 3 PostgreSQL pods where going into CrashLoopBackOff. When I looked at the logs, I saw the following error

postgresql-repmgr 18:20:05.54 INFO  ==> Starting PostgreSQL in background...
waiting for server to start....2020-09-16 18:20:05.612 GMT [120] LOG: listening on IPv4 address "0.0.0.0", port 5432
2020-09-16 18:20:05.613 GMT [120] LOG: listening on IPv6 address "::", port 5432
2020-09-16 18:20:05.616 GMT [120] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2020-09-16 18:20:05.640 GMT [120] LOG: redirecting log output to logging collector process
2020-09-16 18:20:05.640 GMT [120] HINT: Future log output will appear in directory "/opt/bitnami/postgresql/logs".
2020-09-16 18:20:05.643 GMT [122] LOG: database system was interrupted while in recovery at log time 2020-09-16 15:04:55 GMT
2020-09-16 18:20:05.643 GMT [122] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2020-09-16 18:20:05.769 GMT [122] LOG: entering standby mode
2020-09-16 18:20:05.782 GMT [122] LOG: redo starts at 0/24000028
2020-09-16 18:20:05.782 GMT [122] LOG: consistent recovery state reached at 0/25000000
2020-09-16 18:20:05.783 GMT [120] LOG: database system is ready to accept read only connections
2020-09-16 18:20:05.793 GMT [126] LOG: started streaming WAL from primary at 0/25000000 on timeline 4
done
server started
postgresql-repmgr 18:20:05.87 INFO ==> ** Starting repmgrd **
[2020-09-16 18:20:05] [NOTICE] repmgrd (repmgrd 5.1.0) starting up
[2020-09-16 18:20:05] [ERROR] this node is marked as inactive and cannot be used as a failover target
[2020-09-16 18:20:05] [HINT] Check that "repmgr (primary|standby) register" was executed for this node

So the Primary node has entered Standby Mode. Not a problem! (or so I thought), I will quickly repair a PostgreSQL database, inside of a pod that is crashing, and run whatever command I need to run to make it cool again. How difficult could that be?

I will not bore you with what I tried, I will simply tell you what I ended up doing.

1) Prevent the pod from CrashLoopBackOff

Edit the statefuleset

kubectl edit statefulsets.apps postgresql-ha-postgresql

And put in a sleep command

spec:
containers:
- command: ['sh', '-c', 'echo The app is running! && sleep 3600']
env:
- name: BITNAMI_DEBUG
value: "true"
- name: POSTGRESQL_VOLUME_DIR
value: /bitnami/postgresql
- name: PGDATA
value: /bitnami/postgresql/data
- name: POSTGRES_USER
value: postgres

This will make the pod sleep for 3600 second and override the CMD and Entrypoint from the container image.

2) Remove Health Checks

Remaining in that statefulset, remove the livenessProbe

livenessProbe:
exec:
command:
- sh
- -c
- PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h
127.0.0.1 -c "SELECT 1"
failureThreshold: 6
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5

And readinessProbe

readinessProbe:
exec:
command:
- sh
- -c
- PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h
127.0.0.1 -c "SELECT 1"
failureThreshold: 6
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5

If you do not do this, even though the pod is sleeping, it will restart because of the livenessProbe and will not accept traffic because of the readinessProbe.

Save and quit

3) Become the Pod

With those changes made, you will need to manually delete the pod

kubectl delete po postgresql-ha-postgresql-0

And once it has started up, you can shell into it

kubectl exec postgresql-ha-postgresql-0 -it -- /bin/bash

4) Fix what you broke

Once you have shelled into the pod, start PostgreSQL

/opt/bitnami/scripts/postgresql-repmgr/run.sh

With PostgreSQL running, we can actually see what is wrong. Check the status of repmgr

/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show

Here is the error I had

postgresql-repmgr 18:37:20.91 
postgresql-repmgr 18:37:20.91 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 18:37:20.92 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql-repmgr
postgresql-repmgr 18:37:20.92 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues
postgresql-repmgr 18:37:20.92
postgresql-repmgr 18:37:20.93 DEBUG ==> Configuring libnss_wrapper...
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
------+----------------------------+---------+----------------------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000 | postgresql-ha-postgresql-0 | primary | ! running as standby | | default | 100 | 4 | user=repmgr password=sMOIs9AZow host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.lsd-automate.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
1001 | postgresql-ha-postgresql-1 | primary | ? unreachable | ? | default | 100 | | user=repmgr password=sMOIs9AZow host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.lsd-automate.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
WARNING: following issues were detected
- node "postgresql-ha-postgresql-0" (ID: 1000) is registered as an inactive primary but running as standby
- unable to connect to node "postgresql-ha-postgresql-1" (ID: 1001)
- node "postgresql-ha-postgresql-1" (ID: 1001) is registered as an active primary but is unreachable
HINT: execute with --verbose option to see connection error messages

That wrapping above is terrible, so here is a screenshot

And here is the chicken and egg / dumb and dumber part. The first pod will not start, because it is running as Standby and because it is a statefulset the second pod never starts up because it is waiting for the first pod to be ready.

“Insert Nordic curse words here”

We can keep digging while we are here and see how the node is doing

/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf node check

And that returned the following

postgresql-repmgr 18:41:05.32 
postgresql-repmgr 18:41:05.32 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 18:41:05.32 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql-repmgr
postgresql-repmgr 18:41:05.32 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues
postgresql-repmgr 18:41:05.33
postgresql-repmgr 18:41:05.33 DEBUG ==> Configuring libnss_wrapper...
Node "postgresql-ha-postgresql-0":
Server role: CRITICAL (node is registered as primary but running as standby)
Replication lag: OK (0 seconds)
WAL archiving: OK (0 pending archive ready files)
Upstream connection: CRITICAL (node "postgresql-ha-postgresql-0" (ID: 1000) is a standby but no upstream record found)
Downstream servers: OK (this node has no downstream nodes)
Replication slots: OK (node has no physical replication slots)
Missing physical replication slots: OK (node has no missing physical replication slots)
Configured data directory: OK (configured "data_directory" is "/bitnami/postgresql/data")

Ok, confirmed, we have issues, stop messing about and run the repair command. Let’s promote this server from Standby to Primary.

/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf standby promote

Which gives the output

postgresql-repmgr 18:42:13.44 
postgresql-repmgr 18:42:13.44 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 18:42:13.44 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql-repmgr
postgresql-repmgr 18:42:13.45 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues
postgresql-repmgr 18:42:13.45
postgresql-repmgr 18:42:13.46 DEBUG ==> Configuring libnss_wrapper...
NOTICE: promoting standby to primary
DETAIL: promoting server "postgresql-ha-postgresql-0" (ID: 1000) using "/opt/bitnami/postgresql/bin/pg_ctl -w -D '/bitnami/postgresql/data' promote"
waiting for server to promote..... done
server promoted
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "postgresql-ha-postgresql-0" (ID: 1000) was successfully promoted to primary
[REPMGR EVENT] Node id: 1000; Event type: standby_promote; Success [1|0]: 1; Time: 2020-09-16 18:42:14.771401+00; Details: server "postgresql-ha-postgresql-0" (ID: 1000) was successfully promoted to primary
Looking for the script: /opt/bitnami/repmgr/events/execs/standby_promote.sh
[REPMGR EVENT] will execute script '/opt/bitnami/repmgr/events/execs/standby_promote.sh' for the event
[REPMGR EVENT::standby_promote] Node id: 1000; Event type: standby_promote; Success [1|0]: 1; Time: 2020-09-16 18:42:14.771401+00; Details: server "postgresql-ha-postgresql-0" (ID: 1000) was successfully promoted to primary
[REPMGR EVENT::standby_promote] Locking primary...
[REPMGR EVENT::standby_promote] Unlocking standby...

And now if we look at the cluster it looks a lot better

/opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show

Which shows us

postgresql-repmgr 18:42:28.78 
postgresql-repmgr 18:42:28.78 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 18:42:28.79 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-postgresql-repmgr
postgresql-repmgr 18:42:28.79 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-postgresql-repmgr/issues
postgresql-repmgr 18:42:28.79
postgresql-repmgr 18:42:28.80 DEBUG ==> Configuring libnss_wrapper...
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
------+----------------------------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000 | postgresql-ha-postgresql-0 | primary | * running | | default | 100 | 5 | user=repmgr password=sMOIs9AZow host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.lsd-automate.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
1001 | postgresql-ha-postgresql-1 | primary | - failed | ? | default | 100 | | user=repmgr password=sMOIs9AZow host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.lsd-automate.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
WARNING: following issues were detected
- unable to connect to node "postgresql-ha-postgresql-1" (ID: 1001)
HINT: execute with --verbose option to see connection error messages

5) Return everything back the way you found it

Edit your statefulset again

kubectl edit statefulsets.apps postgresql-ha-postgresql

And remove

command: ['sh', '-c', 'echo The app is running! && sleep 3600']

And return livenessProbe

livenessProbe:
exec:
command:
- sh
- -c
- PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h
127.0.0.1 -c "SELECT 1"
failureThreshold: 6
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5

And return readinessProbe

readinessProbe:
exec:
command:
- sh
- -c
- PGPASSWORD=$POSTGRES_PASSWORD psql -w -U "postgres" -d "postgres" -h
127.0.0.1 -c "SELECT 1"
failureThreshold: 6
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5

Finally, restarting your pods

kubectl delete po postgresql-ha-postgresql-1 postgresql-ha-postgresql-0 postgresql-ha-pgpool-74c8574dbf-vnjtb

I hope this helps someone else in the future so you don’t have to go through what I did. And as always if there is a better way to do this, add a comment.

What is LSD?

If you saw the words “LSD” and were curious….well….

LSD is an open source focused company that helps companies along their journey into the cloud, or what we like to call the LSD Trip.

--

--

Neil White

Mech Warrior Overlord @ LSD. I spend my days killing Kubernetes, operating Openshift, hollering at Helm, vanquishing Vaults and conquering Clouds!