This is a story of how a single pod can bring a node and cluster to it’s knees!

I’ve been working with Kubernetes for a while and one of the big issues I have to deal with is storage. Now I hear your cries, storage on Kubernetes is a problem, but I am not talking about Persistent Volumes, I am talking about something else that might be a bigger pain in the butt, pod log storage!

It’s not the same!

What exactly am I talking about? I am talking about Node-pressure Eviction, and specifically DiskPressure.

Top Tip. DiskPressure != OutOfDisk

Yeah, so who cares! Disk Pressure Disk Schmessure! Well sir (or lady however you identify) because disk is one of the things you cannot really control, you should be caring. Yes I have resource requests and constraints that I can set so that a single pod cannot use all the CPU or Memory, but AFAIK there is nothing stopping you from throttling disk IO or disk storage.

And what I mean by that is, what is stopping a tiny little pod from generating thousand upon thousands of log entries to stdout?

Nothing really. And this is a big big problem because those logs are written somewhere and are taking up storage somewhere? Yeah you never thought of that did you! Well thank Odin for me and what I am about to show you.

Pod logs have to go somewhere, right!

This is important because when the root filesystem or image filesystem hits it’s eviction threshold (standard is 85%) pods get evicted! Yes Kubernetes tries to mitigate this by trying to clear up some storage but if it cannot bring down the filesystem utilization it will evict all the pods on that node.

Where is the image filesystem? Good question. Unless you have asked for it to be in a different location, it is under /var/lib/docker (since most of us use docker)

Here is how you can find our your current utilization.

Remember if “Use%” hit’s 85% we are in for a bad time.

Let’s start at the beginning

Most of us use Docker, so I am going to work with that container runtime, however all container runtimes work on the same principal. Docker saves all logs from stdout to a flat file somewhere on a node. Want to find out where?

I am going to pick on a random pod to demonstrate how you find out exactly where logs are located “vitality-orchestration-service-13-hl4hx”

If I describe that pod it gives me some clues:

$ kubectl describe po vitality-orchestration-service-13-hl4hx | egrep "Node:|Container ID:"Node:
vgqengapps01/172.26.X.X
Container ID: docker://d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177

That tells us that this pod is running on a node called “vgqengapps01” and on that node the pod has the docker container ID of “d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177”

Let’s jump onto that node and get some more detail (you need to become root, it’s simply easier)

$ ssh vgqengapps01
$ sudo -s
$ docker inspect d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177 | grep LogPath
"LogPath": "/var/lib/docker/containers/d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177/d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177-json.log",

That shows us the local location of that log file. That means that if I were to run:

$ cat /var/lib/docker/containers/d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177/d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177-json.log

or

$ kubectl logs vitality-programme-business-service-16-psm84

I will get the same result! Amazing!

Oh, and what is the size of all those logs?

$ ls -1sh /var/lib/docker/containers/d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177/d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177-json.log15M 

15 Megabytes, not bad not bad. But is anything stopping this guy (or gal, you know my feeling on this) from using 15 Gigabytes of logs? I will not regale you of a story where someone put an entire Kafka cluster running on Kubernetes in DEBUG mode and generated about 1 Terabytes of logs in a day. But the short answer is nothing stops this problem out the box.

Controlling the logs, somewhat…

Luckily for us Docker has a couple of setting we can adjust to keep excessive logging under control. Here is an example of some Docker settings we use in our clusters

$ cat /etc/docker/daemon.json{
"live-restore": true,
"storage-driver": "overlay2",
"exec-opts": ["native.cgroupdriver=systemd"],
"storage-opt":["overlay2.override_kernel_check=true","overlay2.size=10G"],
"log-driver": "json-file",
"log-level": "info",
"log-opts": {
"cache-disabled": "false",
"cache-max-file": "5",
"cache-max-size": "20m",
"cache-compress": "true",
"env": "os,customer",
"max-file": "5",
"max-size": "100m"

},
"fixed-cidr": "192.168.240.1/25",
"bip": "192.168.240.1/24"
}

The section highlighted in bold shows that Docker will make sure a log file does not use more than 100M and if it does, it will rotate it to a maximum of 5 log files. Meaning at no point can a single pod use more that 500M of logs.

Remember our pod we looked at earlier? This guy: vitality-programme-business-service-16-psm84

Did he use more than 100M?

When you find out it was using more than 15M!

Let’s actually change directory (this text wrapping is the worst)

$ cd /var/lib/docker/containers/d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177$ ls -1sh
total 397M
16M d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177-json.log
96M d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177-json.log.1
96M d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177-json.log.2
96M d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177-json.log.3
96M d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177-json.log.4

I knew it! That sneaky pod has been using way more than 15M!!! We can see that it’s logs have been rotated to the max! If we had not had that Docker setting those logs would keep on growing and growing, filling up /var/lib/docker and then hitting on 85% threshold and then we get Disk Pressures on the Node.

So problem solved? Time to get a beer? Time to relax because we have solved our issue? I mean, what’s the big pain about local storage…

Little Did He Know

Now our story really begins (feels like a Quentin Tarantino film) because this is what actually happened. Our support team kept getting these alerts at random times.

Prometheus and Alertmanger FTW

The team would log in and move pods around to balance memory, but this was becoming more and more frequent, so we want to get a better understanding. Enter Grafana

All is fine until it isn’t

To the keen observer we can see that all is well until 23:30, when suddenly vgqengapps03 goes from 81% total memory utilization to 9% total utilization. This is the wonderful concepts of self healing and node evictions coming to bite us in the ass.

After vgqengapps03 suddenly has all this memory available, we can see vgqengapps04 jumps up from 57% to above 90%, which lasts until 8:00am and then pattern repeats itself.

Here is the Catch-22 of Kubernetes and the nature of self healing:

  • Node goes Above 90% of Memory Usage
  • Alert is generated
  • Kubernetes evicts pod
  • Error clears!
  • All is fine again
Kubernetes! The choice for a Gru generation

Because of the large mount of time between alerts not much attention was paid to this problem and because it was actually self healing every 12 hours. After a few days we decide to check out why suddenly we had a these huge memory spikes.

Enter the DiskPressure!

This is where it become a bit weird, because we were only getting alerts on memory and nothing else. So of course we assumed it was a memory issue. Some pod was eating all the memory? But thanks to LimitRanges, which we implemented on all namespaces, we knew there couldn’t be a single pod eating up all the memory.

We also knew Docker had been configured to prevent a single pod from eating up all the storage and we hadn’t received a storage alert, so why on earth would we even look at storage?

Also, when it comes to storage you normally do not alert when a threshold is breached immediately as you get so many false positives. We normally only alert if a threshold has been breached for over an hour.

Nevertheless we brought up our Filesystem Utilization graph and …. what the hell!

See our alert is set to 85%. Yeah…

Ok, we have some correlation here. Why is the disk increasing like crazy! Why would that cause a node eviction…oh yeah…Node-pressure Eviction! Remember that old chest nut from the start of this story.

In Space, No one can hear you Clean

This makes no sense. We can see from Grafana that it is /var/lib/docker, let’s go have a look at what’s happening here

$ ssh vgqengapps01
$ sudo -s
$ df -h /var/lib/docker
Filesystem Size Used Avail Use% Mounted on/dev/mapper/docker-docker--pool 100G 71G 30G 71% /var/lib/docker

Ok, we have 30G available. So what is using the storage?

$ du -hs /var/lib/docker/* | sort -h0 /var/lib/docker/plugins
0 /var/lib/docker/swarm
0 /var/lib/docker/tmp
0 /var/lib/docker/trust
180K /var/lib/docker/network
26M /var/lib/docker/image
203M /var/lib/docker/volumes
5.3G /var/lib/docker/containers
36G /var/lib/docker/overlay2

Huh!?! Ok, I did not get an A+ for mathematics but 36G+5.3G does not equal 71G. Luckily I come from a Linux background and I have seen this gremlin before and I have a clue what it could be.

Hey Linux! What files do you have that are Open? Get it… Open! This guy…

$ lsof /var/lib/docker/ > /tmp/lsof-docker.txt
$ cat /tmp/lsof-docker.txt | grep deleted | head

Check out what popped up

Hello filebeat!

Ok! Now we are getting somewhere. Those files look veeeeery familiar. Remember our log files from earlier. Remember when I took all that time to explain stuff to you! It’s all coming together now, isn’t it.

For an explanation from a smarter person as to why deleted files are taking up space, go here

Let’s use some other fancy commands quickly. I want to see all the processes that are still holding on files.

This command will group, by the first column, all the processes holding onto files

$ cat /tmp/lsof-docker.txt | grep deleted | awk  'NR>1{arr[$1]++}END{for (a in arr) print a, arr[a]}'filebeat 575

That tells us that the process “filebeat” is holding onto 575 files.

And how much storage are all these deleted files taking up? Run this command to add up all the numbers in column 7

$ cat /tmp/lsof-docker.txt | grep deleted | awk '{print $7}' | awk '{ sum += $1 } END { print sum }'57600322165

57600322165 bits. Great, what is that in human speak?

$ echo "57600322165/(1024*1024*1024)" | bc
53

Wow! Those files are taking up 53G. And it’s all by the same process… filebeat

Turn the Beat Around

At LSD, we use the Elastic Cloud on Kubernetes (ECK) to centrally gather all logs because remember we rotate log files from pods. Did anyone of you think to ask, what happens to those old logs once we use 501Mb of logging? Ironically they don’t get deleted (thanks Filebeat), jokes aside they are in fact deleted but they are also collected by the ECK and ingested into a central Elasticsearch instance that we use for log retention and analytics.

ECK is made of up 3 major components:

  • Elasticsearch — Time series data store
  • Kibana — Visualization tool for Elasticsearch
  • Filebeat — Log collection

Aaaah this is making sense now.

Remember these files (lots of remembering in this story)

$ cd /var/lib/docker/containers/d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177$ ls -1sh
total 397M
16M d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177-json.log
96M d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177-json.log.1
96M d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177-json.log.2
96M d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177-json.log.3
96M d28698dcd8830663d2980c632d3fc36786b6b57d0841092da3f2e6f4d2552177-json.log.4

When those files are created by Docker, Filebeat picks them up, locks onto them and ingests them into Elasticsearch. But when Docker rotates the file, deleting it, Filebeat holds onto it.

A quick solution to this exact problem was simply to delete Filebeat on that affected node: vgqengapps01

$ kubectl get po --all-namespaces -owide | grep filebeat | grep vgqengapps01lsdobserve                          lsdobserve-filebeat-beat-filebeat-6gjtd                   1/1       Running     0          3d        172.26.23.203   vgqengapps01   <none>$ kubectl delete po lsdobserve-filebeat-beat-filebeat-6gjtd -n lsdobserve

Before I deleted the pod

$ df -h /var/lib/dockerFilesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/docker-docker--pool 100G 71G 30G 71% /var/lib/docker

After I deleted the pod

$ df -h /var/lib/dockerFilesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/docker-docker--pool 100G 17G 84G 17% /var/lib/docker

That is a difference of… you guessed it! 53G or 57600322165 bits

But wait, all is well now?

The point of this article wasn’t to pick on Filebeat. It was to highlight that many times we have open files that are not cleared up and how to find the process that is locking onto it. And that when you run out of storage on key mount points ( / & /var/lib/docker) you get DiskPressure which can have adverse affects on your cluster.

ECK does have a fix for this by the way

But yes, all is not well. It’s OK, but not super well. Remember this graph?

The crazy disk usage.

Turns out someone enabled DEBUG on their applications which was generating 16G of logs an hour. We will not mention names, but if you’re reading this, you know who you are!

How did we find that?

Thank you to Kibana and thanks to me for thinking about using ECK (I take all credit for this), I set the time and date to those spikes we see above and said show me the top 10 pods with the most logs, which revealed…

You dirty rat!

A pod called “vitality-error-management-service” generated 13 million (yes that’s million with a “mill”) logs in 8 hours, using 90% of all logging during that time.

Remember I said earlier we had resources requests set, here are the resources from that pod

Limits:
cpu: 600m
memory: 512Mi
Requests:
cpu: 50m
memory: 512Mi

Imagine if I had given it more CPU.

Armed with that information we can now go to application team and ask them to please switch off Debugging as it is generating 16G of logs an hour, or 384G of logs a day or 11Tb of logs a month :)

Are we living in a simulation?

Whenever something like this happens above, I personally would like to know the details of this situation. For example when my disk fill ups, how much time do I have to recover? How long does it take to evict pods? Does it try to clean up? How many times. So just for you guys that stuck around to the end I wanted to run through a simulation of the above scenario.

Let’s fill up a filesystem

$ df -h /var/lib/dockerFilesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/docker-docker--pool 100G 20G 81G 20% /var/lib/docker

Commence with the filling…

$ dd if=/dev/zero of=/var/lib/docker/70G-tmp bs=1G count=7070+0 records in
70+0 records out
75161927680 bytes (75 GB) copied, 196.061 s, 383 MB/s

While I was doing that, in another sessions I was following the disk usage

$ df -h /var/lib/dockerFilesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/docker-docker--pool 100G 64G 37G 64% /var/lib/docker
$ df -h /var/lib/dockerFilesystem Size Used Avail Use% Mounted on
/dev/mapper/docker-docker--pool 100G 94G 6.6G 94% /var/lib/docker
$ df -h /var/lib/dockerFilesystem Size Used Avail Use% Mounted on
/dev/mapper/docker-docker--pool 100G 88G 13G 88% /var/lib/docker
$ df -h /var/lib/dockerFilesystem Size Used Avail Use% Mounted on
/dev/mapper/docker-docker--pool 100G 84G 16G 85% /var/lib/docker

Check it out! We filled it up to 94% but then it suddenly dropped to 84%. Remember (yeah I am not done with that word) earlier we said that Kubernetes tries to clear up some storage and heal from this situation. Here we can see it doing exactly that. And you can get further details by describing the node

$ kubectl describe no vgqengapps01

It did in fact clear up some storage, but the “dd” command reclaimed it so quickly that it thinks it didn’t.

You can also see that the the status of DiskPressure is now True

In my case, after a several attempts at trying to Free Disk Space and failing we eventually get an “EvictionthresholdMet”

This will evict all pods from that node, which will look as follows

$ kubectl get po  -owide | grep vitality-error-management-service

And you can see exactly why that pod was evicted

$ kubectl describe po vitality-error-management-service-0-6-5mhgh

The 2 sections that are important are the Status

And the Events

And folks there you have it. Never again will you think about pod logs again. Never again will sleep soundly at night, knowing that a developer can ruin your day by simply putting his (her…) application into DEBUG mode and then turning their phone off!

And in closing

I wanted to highlight in this article how pod logs are something people do not care about and how it can have a very negative affect on your cluster. I also took the time an effort to explain the “open file” problem because finding the bad pod was half the problem, the easy half. Understanding why you have less and less filesystem storage available even when logs are being rotated and even when the command “du” is showing you everything is fine, can be tricky to wrap your head around.

I hope this was useful to someone, and maybe next time I will write another article on how local storage in a pod (not logging storage) can also make your life a painful. What is stopping someone from creating a 100G file with the “dd” command inside a container and then eating up all your ephemeral storage? I will leave you with that thought while you go to sleep tonight…

What is LSD?

If you saw the words “LSD” and were curious…well…

LSD is an open source focused company that helps companies along their journey into the cloud, or what we like to call the LSDtrip.

Mech Warrior Overlord @ LSD. I spend my days killing Kubernetes, operating Openshift, hollering at Helm, vanquishing Vaults and conquering Clouds!