This is a story of how a single pod can bring a node and cluster to it’s knees!

I’ve been working with Kubernetes for a while and one of the big issues I have to deal with is storage. Now I hear your cries, storage on Kubernetes is a problem, but I am not talking about Persistent Volumes, I am talking about something else that might be a bigger pain in the butt, pod log storage!

It’s not the same!

What exactly am I talking about? I am talking about Node-pressure Eviction, and specifically DiskPressure.

https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-conditions


For those out there, playing in the land of Kubernetes / Openshift, we know about etcd and how it’s the backend that stores the data for Kubernetes. We also know it’s important and that is why most Kubernetes installation docs will say “install 3 etcd replicas” so that if one fails you are fine.

etcd is also super lightweight, meaning you can back up clusters in a minute with a small snapshot file.

Here is an example of an etcd snapshot size of a cluster of 16 nodes running 662 pods.

hostnames have been removed to protect the innocent.

About 300Mb for a daily backup and 2.3Gb for…


Disclaimer

Before I begin I want to state that this is not an article to throw anybody under the bus, or to blame any company. We are partners with all the major cloud providers, all the major Open Source vendors and we value those partnerships. This is more a dark tale of horror and fear that could happen to each and every one of you, so beware and heed our warning…

On with the show

At LSD, where I work, one of the core values is freedom (and openness, that one is pretty cool too). Because LSD is an open source technology company, with a…


I am writing this article so that others do not have to go through the pain I just experienced, after I rebooted a couple of Kubernetes nodes because the time had drifted substantially on all of them.

We have a project running on Kubernetes that is composed of many services all sharing a central, highly available PostgreSQL database. We first detected the time drift issue when gRPC on Gitlab wasn’t working. Once we corrected the NTP settings, I issued a reboot on the nodes and that is where our story begins…


Damn, that title is a mouthful. Oh yeah, I wrote this in July 2020. If you are currently in the year 2022, this might be outdated…

I recently received a request from a customer to migrate their Hashicorp Vault installation from a traditional virtual machine deployment, onto a newly deployed Openshift (think Kubernetes with more bells and whistles) platform. They recently tasted of the benefits of Kubernetes and wanted their Vault setup to also benefit from it. Their existing Vault installation suffered from the usual bane of IT, such as:

  • Load Balancing wasn’t completed 100%
  • Upgrades were not maintained
  • The…

Neil White

Mech Warrior Overlord @ LSD. I spend my days killing Kubernetes, operating Openshift, hollering at Helm, vanquishing Vaults and conquering Clouds!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store