Recovering from PostgreSQL-HA Failure on Kubernetes

I am writing this article so that others do not have to go through the pain I just experienced, after I rebooted a couple of Kubernetes nodes because the time had drifted substantially on all of them.

We have a project running on Kubernetes that is composed of many services all sharing a central, highly available PostgreSQL database. We first detected the time drift issue when gRPC on Gitlab wasn’t working. Once we corrected the NTP settings, I issued a reboot on the nodes and that is where our story begins…

Disclaimer!
I am in no way a PostgreSQL expert. If there is a better way of solving this problem, please share, as I couldn’t find the answer.

Because we have a central PostgreSQL that multiple services use, we opted to deploy the the Bitnami PostgreSQL-HA helm chart. It was quick and simple and I didn’t take to much time digging into it. Once deployed it provides the following pods

With the following kubernetes services

After I rebooted the nodes, all 3 PostgreSQL pods where going into CrashLoopBackOff. When I looked at the logs, I saw the following error

So the Primary node has entered Standby Mode. Not a problem! (or so I thought), I will quickly repair a PostgreSQL database, inside of a pod that is crashing, and run whatever command I need to run to make it cool again. How difficult could that be?

I will not bore you with what I tried, I will simply tell you what I ended up doing.

1) Prevent the pod from CrashLoopBackOff

Edit the statefuleset

And put in a sleep command

This will make the pod sleep for 3600 second and override the CMD and Entrypoint from the container image.

2) Remove Health Checks

Remaining in that statefulset, remove the livenessProbe

And readinessProbe

If you do not do this, even though the pod is sleeping, it will restart because of the livenessProbe and will not accept traffic because of the readinessProbe.

Save and quit

3) Become the Pod

With those changes made, you will need to manually delete the pod

And once it has started up, you can shell into it

4) Fix what you broke

Once you have shelled into the pod, start PostgreSQL

With PostgreSQL running, we can actually see what is wrong. Check the status of repmgr

Here is the error I had

That wrapping above is terrible, so here is a screenshot

And here is the chicken and egg / dumb and dumber part. The first pod will not start, because it is running as Standby and because it is a statefulset the second pod never starts up because it is waiting for the first pod to be ready.

“Insert Nordic curse words here”

We can keep digging while we are here and see how the node is doing

And that returned the following

Ok, confirmed, we have issues, stop messing about and run the repair command. Let’s promote this server from Standby to Primary.

Which gives the output

And now if we look at the cluster it looks a lot better

Which shows us

5) Return everything back the way you found it

Edit your statefulset again

And remove

And return livenessProbe

And return readinessProbe

Finally, restarting your pods

I hope this helps someone else in the future so you don’t have to go through what I did. And as always if there is a better way to do this, add a comment.

What is LSD?

If you saw the words “LSD” and were curious….well….

LSD is an open source focused company that helps companies along their journey into the cloud, or what we like to call the LSD Trip.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Neil White

Mech Warrior Overlord @ LSD. I spend my days killing Kubernetes, operating Openshift, hollering at Helm, vanquishing Vaults and conquering Clouds!