ThunderStruck! When Clouds scare the hell out of you.

14 min readMar 25, 2021

Disclaimer

Before I begin I want to state that this is not an article to throw anybody under the bus, or to blame any company. We are partners with all the major cloud providers, all the major Open Source vendors and we value those partnerships. This is more a dark tale of horror and fear that could happen to each and every one of you, so beware and heed our warning…

On with the show

At LSD, where I work, one of the core values is freedom (and openness, that one is pretty cool too). Because LSD is an open source technology company, with a core value of freedom, the idea of cryptocurrency to us is a match made in heaven. I am not going to delve into cryptocurrency, as there are so many amazing articles about it, but I simply want to say that the idea of economic freedom, by using technology is something we strongly believe in. So when Ethereum Staking launched this was something we jumped into quickly and happily.

I am fortunate to work with people far smarter than me and a fun aspect of whenever we look at a new project is how to build it. We could build it quick and easy by deploying a virtual machine in the cloud somewhere and installing the Validator. But why make it so simple when you can use Kubernetes, and containers, and ArgoCD, and Terraform, and Let’s Encrypt and… you get the picture.

But why make it way harder than it needs to be? At LSD we preach the gospel of Multi Cloud and Hybrid Cloud. We spread the good word of containers and portability, we shout from the mountain tops about GitOps and ArgoCD, of Terraform and Infrastructure as Code and of cabbages and kings. We believe that no-one should ever be locked into a technology because we believe in this freedom, which means you should also be portable, in the way you can move your tech whenever you feel like, regardless of the reason.

This freedom does come at a cost, a cost in energy, time and complexity. But as I said earlier, I work with a lot of smart people, so the complex stuff is their problem.

Plans are worthless, but planning is everything

Once we decided to start staking, we quickly agreed that where the validators run shouldn’t matter greatly because we should be able to move them if we wanted. So this was our plan:

Terraform to provision the infrastructure
Kubernetes to manage our containers (we containerize everything)
Helm to manage our application deployment on Kubernetes
ArgoCD to manage our deployments within Kubernetes
Let’s Encrypt to make it all ingress secure and safe

With these components we ticked the following boxes:

Infrastructure as Code
No more asking people how stuff was deployed, we could look at code
Portability
Every cloud has a Kubernetes offering, so containers were not a problem
GitOps
All deployment information will be stored in Git as the source of truth
Security
All certificates will be managed by Let’s Encrypt, you beautiful company you

Happy with all those elements, we decided to start with Google Cloud because we have been using them from the very beginning and had a lot of infrastructure running there already.

Fast forward 2 months later and we have our own dedicated project on Google Cloud, running Kubernetes, built by Terraform, enforced by ArgoCD, secured by Let’s Encrypt, running our Ethereum Validators…when one morning…

while I nodded, nearly napping, suddenly there came a tapping,
As of someone gently rapping, rapping at my chamber door.

(it was an email, but there ain’t much poetry out there about email ok‽)

They say pictures are worth a 1000 words, so I will show you the email we received in a picture.

Projects Names and Project IDs have been removed to protect the innocent.

Now I want to make it clear, we were not using Google Cloud to do any crypto mining. I can only assume that would be an extremely expensive endeavor and something we really do not care for. We were not worried and clicked the magic appeal button.

And like I said, I was not too worried. We got the warning, we clicked on appeal, we were not doing anything wrong and were happy to help get to the bottom of it. But then sh*t went sideways…

Here is the timelines of events (all happening in less than 1 hour):

10:46am — Warning
11:02am — Appeal requested
11:39am — project suspended

It is at this point our story becomes a bit murky, where uncertainty creeps in and you start questioning your life decisions.

We have multiple projects on Google Cloud, so I want to make it clear to all two people reading this, that Google only suspended the project that was running our Ethereum Staking Validators. All our other projects were fine. But what does suspended mean?

We were learning about this in real time, but here is the TL;DR. You cannot do anything! You cannot access the Kubernetes Engine, you cannot get hold of your data, you cannot control any traffic. Simply put, anything in that project is dead to you. If you tried to go to the project you were redirected to a page that said the project was suspended.

Come to think of it, we did not use this project for any DNS requests, so I have no idea how DNS would have been affected, i.e. would our DNS lookups work? But from what we experienced I would assume the suspension would have killed that ability as well, which would have been catastrophic for us. (Mental Note: Put DNS on it’s own project thereby making it safe from any overlord interference)

And here is where the problems began. When you are Staking Ethereum you cannot have multiple Validators running, else you will be slashed. So what could we do? We could wait for Google to unsuspend the project, or we could build it somewhere else.

But what if Google only unsuspends the project in a couple of days time, which means each passing hour we lose staking rewards for being down, which means we lose actual Ether.

Or what if we build a new project, get our Validators up and running again and then Google unsuspends the account and we get slashed for having double Validators? Which is even worse! Unless Google only unsuspends our accounts in a couple of weeks!! Which is even worse!!

I tried calling Google to get some help on this, but didn’t have much success. Their on hold music is a Circus Music theme, which is not the best sound you want to hear when you are stressed out and looking for support. Just saying…

And this was the heart of our problem. We were in a position whereby we could not make an accurate decision because we had never been in this situation before. I also understands that Google didn’t target us, but that they picked up an anomaly on our project which triggered an automatic response to suspend our project automatically while we sat there helpless.

And this is what we wanted to highlight — we were at the mercy of Google.

Buy it, use it, break it, fix it, trash it, change it, mail — upgrade it

We put a clock on this issue, and decided that if the project was not unsuspended within 6 hours we would re-platform, which is what we ended up doing.

When the deadline arrived, we pulled the trigger!

$ gcloud projects delete XXXXXXXX-258916
Your project will be deleted.Do you want to continue (Y/n)?  yDeleted [https://cloudresourcemanager.googleapis.com/v1/projects/XXXXXXXX-258916].You can undo this operation for a limited period by running the command below.
    $ gcloud projects undelete XXXXXXXX-258916See https://cloud.google.com/resource-manager/docs/creating-managing-projects for information on shutting down projects.

And this is now where our earlier decisions helped us out. We quickly decided to change clouds, going from Google to Amazon. We thought that if we simply rebuilt on Google we would be suspended again, so Amazon here we go!

Terraform to the Rescue!

When we built our original Google GKE cluster we only required 4 files for Terraform to work:

backend.tf (we are storing our state in a google bucket)
variables.tf (defining our input variables)
terraform.tfvars (our actual variables)
main.tf (the actual file that will build our GKE cluster)

And you want to see how simple they are?

backend.tf

terraform {
  backend "gcs" {
    bucket = "XXXXXXXX-terraform-state"
    prefix = "cluster-01"
    credentials = "XXXXXXXX-3d84f0b78bdc.json"
  }
}

variables.tf

variable "google_project_name" {
  type = string
}variable "google_project_id" {
  type = string
}variable "google_gke_cluster_name" {
  type = string
}variable "google_gke_cluster_location" {
  type = string
}variable "google_bucket" {
  type = string
}# Sets the number of nodes in the pool. This appears to be multiplied by x3. Therefore if you set this to 1, you will get 3 nodes. Set it 2, you will get 6 nodes.
variable "google_node_count" {
  type = number
}

terraform.tfvars

google_project_name             = "XXXXXXX"
google_project_id               = "XXXXXXX"
google_gke_cluster_name         = "cluster-01"
google_gke_cluster_location     = "europe-west4"
google_bucket                   = "XXXXXXX"
google_node_count               = "1"

main.tf (110 lines)

provider "google" {
  credentials = file("XXXXXX-3d84f0b78bdc.json")
  project     = var.google_project_id
}provider "google-beta" {
  credentials = file("XXXXXXX-3d84f0b78bdc.json")
  project     = var.google_project_id
}resource "google_container_cluster" "primary" {
  provider                  = google-beta
  name                      = var.google_gke_cluster_name
  location                  = var.google_gke_cluster_location
  remove_default_node_pool  = true
  initial_node_count        = 1
  enable_shielded_nodes     = true
  enable_intranode_visibility = true
  release_channel {
    channel = "RAPID"
  }
  master_auth {
    username = "kubeadmin"
    password = "XXXXXXXXXXX"
    client_certificate_config {
      issue_client_certificate = false
    }
  }
  network_policy {
    enabled = true
    provider = "CALICO"
  }
  ip_allocation_policy {
    cluster_ipv4_cidr_block = "192.168.0.0/18"
    services_ipv4_cidr_block = "192.168.64.0/18"
  }
  addons_config {
    http_load_balancing {
      disabled = false
    }
    horizontal_pod_autoscaling {
      disabled = true
    }
    istio_config {
      disabled = false
      auth = "AUTH_MUTUAL_TLS"
    }
  }
  timeouts {
    create = "30m"
    update = "40m"
  }
}resource "google_container_node_pool" "np-01" {
  name        = "node-pool-01"
  location    = var.google_gke_cluster_location
  cluster     = google_container_cluster.primary.name
  node_count  = var.google_node_countmanagement {
    auto_repair  = true
    auto_upgrade = true
  }node_config {
    preemptible  = true
    machine_type = "e2-medium"metadata = {
      disable-legacy-endpoints = "true"
    }oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]
  }
}resource "google_container_node_pool" "np-02" {
  name       = "node-pool-02"
  location   = var.google_gke_cluster_location
  cluster    = google_container_cluster.primary.name
  node_count = var.google_node_countmanagement {
    auto_repair  = true
    auto_upgrade = true
  }node_config {
    preemptible  = true
    machine_type = "e2-medium"metadata = {
      disable-legacy-endpoints = "true"
    }oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]
  }
}

With those 4 files we can spin up a GKE cluster in minutes and have the Terraform state running in a globally accessible bucket.

But now we need to modify this for an Amazon EKS deployment.

backend.tf

# List of Regions: https://docs.aws.amazon.com/general/latest/gr/rande.html
terraform {
  backend "s3" {
    bucket = "XXXXXXXXX-terraform"
    key    = "cluster-01/terraform/state"
    region = "eu-west-1"
  }
}

variables.tf

variable "aws_region" {
  type = string
}variable "aws_vpc_name" {
  type = string
}variable "aws_eks_cluster_name" {
  type = string
}variable "aws_eks_k8s_version" {
  type = string
}variable "aws_eks_k8s_instance" {
  type = string
}

terraform.tfvars

aws_region             = "eu-west-1"
aws_vpc_name           = "XXXXXXXXXX"
aws_eks_cluster_name   = "cluster-01"
aws_eks_k8s_version    = "1.19"
# https://aws.amazon.com/ec2/instance-types/
aws_eks_k8s_instance   = "t3.medium"

main.tf (70 lines)

provider "aws" {
  region = var.aws_region
}data "aws_eks_cluster" "cluster" {
  name = module.eks.cluster_id
}data "aws_eks_cluster_auth" "cluster" {
  name = module.eks.cluster_id
}provider "kubernetes" {
  host                   = data.aws_eks_cluster.cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority.0.data)
  token                  = data.aws_eks_cluster_auth.cluster.token
}data "aws_availability_zones" "available" {
}locals {
  cluster_name = var.aws_eks_cluster_name
}module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"name                 = var.aws_vpc_name
  cidr                 = "172.16.0.0/16"
  azs                  = data.aws_availability_zones.available.names
  private_subnets      = ["172.16.1.0/24", "172.16.2.0/24", "172.16.3.0/24"]
  public_subnets       = ["172.16.4.0/24", "172.16.5.0/24", "172.16.6.0/24"]
  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = truepublic_subnet_tags = {
    "kubernetes.io/cluster/${local.cluster_name}" = "shared"
    "kubernetes.io/role/elb"                      = "1"
  }private_subnet_tags = {
    "kubernetes.io/cluster/${local.cluster_name}" = "shared"
    "kubernetes.io/role/internal-elb"             = "1"
  }
}module "eks" {
  source  = "terraform-aws-modules/eks/aws"cluster_name    = "${local.cluster_name}"
  cluster_version = var.aws_eks_k8s_version
  subnets         = module.vpc.private_subnetsvpc_id = module.vpc.vpc_idnode_groups = {
    first = {
      desired_capacity = 1
      max_capacity     = 10
      min_capacity     = 1
      
      instance_type = var.aws_eks_k8s_instance
    }
  }write_kubeconfig   = true
  config_output_path = "./"
}

To execute that Terraform, you run the following

terraform init
terraform plan
terraform apply

For the guys that are not familiar with the above, the “terraform apply” created us an EKS cluster, running in Western Europe in an autoscale configuration, starting with 1 worker node and being able to expand to 10 worker nodes automatically.

70 lines of code got me a highly available Kubernetes cluster running in Europe, in about 20 minutes!

And how long did it take me to make the new EKS configuration for Terraform? About 1 hour! Unfortunately Discord doesn’t save the exact time, but below is a shot from our conversation during this operation. Again, I am only demonstrating how quick this operation went.

Sensitive information has been removed to protect the innocent.

I want to point out that I am not a smart guy; The reason this was quick and easy is because Terraform makes it quick and easy.

HELM 108!

With EKS up and running we can now use Helm to deploy all our GitOps application stacks.

First things first, we set up our Ingress. We chose Traefik, and ran the following 4 lines (yes 4 lines)

helm repo add traefik https://helm.traefik.io/traefik
helm repo update
helm show values traefik/traefik > traefik.values.yaml
helm upgrade -i --namespace traefik traefik traefik/traefik --create-namespace --values traefik.values.yaml

Next we install Cert-Manager to we can utilize Let’s Encrypt (yes, 4 lines again)

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm show values jetstack/cert-manager > cert-manager.values.yaml
helm upgrade -i cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace  --values cert-manager.values.yaml

Now install ArgoCD. Guess what I am going to do… in 4 lines…

helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
helm show values argo/argo-cd > argocd.values.yaml
helm upgrade -i --namespace argocd argocd argo/argo-cd --create-namespace --values argocd.values.yaml

And just like that we have all our supporting applications ready to go. That took about 2 minutes.

Argo, Argo, it’s off to work we Argo

Next we can deploy our Validator, thanks to the CRD, by simply using the ArgoCD “application”

Here is an example of an ArogCD application that will deploy our applications, from Github, and keep it in sync so only changes that are made to the source code are applied to the deployed application. Or in a word “GitOps”.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: prysm-validator
  namespace: argocd
spec:
  destination:
    namespace: eth2
    server: https://kubernetes.default.svc
  project: staking-prod
  source:
    helm:
      valueFiles:
        - values.yaml
    path: argocd/charts/test/prysm-validator
    repoURL: git@github.com:FAKENAME-prod/api.cluster-01.FAKENAME-prod.git
    targetRevision: main
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

And here is the structure of our Github repo

Which means we can apply all our CRD applications with a single command!

$ kubectl apply -R -f argocd/

And just like that ladies and gentleman, we have our application stack restored and making sweet, sweet Ether.

Didn’t I mention something about Let’s Encrypt?

Did I mention how much I love Let’s Encrypt? Check out how we automatically have all our Ingress traffic secured by Let’s Encrypt’s ACME bot.

Check out these Issuers!

Check out these Request and Orders and Certificates!!

Recap Time!

So how long did this all take to re-platform our entire Ethereum Staking Validator project from Google Cloud to Amazon AWS? I am going to round up these numbers to be generous

Destroy project in Google Cloud (1 minute)
Rework Terraform Code to build Amazon EKS (2 hour)
Terraform to build EKS (30 minutes)
Deploy Ingress + Cert-manager + ArgoCD (5 minutes)
Apply ArgoCD application CRD (1 minute)
Wait for ArgoCD to finish doing it’s than (5 minutes)

Therefore I declare that in under 3 hours, you too can move an entire application stack from one cloud to another cloud by following the above practices and not being scared of being… ThunderStruck!

Closing Statement

If you have reached this part, wow, I am surprised. Someone actually read this? Now it is important to state that we were able to be agile and mobile because of one very big constraint we did not have, and that is persistent data. The project we are running does not require persistence, which made everything a million times easier. If we did have data in the suspended Google project this article would never have been written. That said, this has highlighted to me that you want to make plans to get your data out of a Cloud regardless of the reason, because it’s the reason that you don’t think of that ends up biting you in the butt.

Thinking out loud here, I would probably backup my data to a separate Cloud as an exit plan.

The reason I took the time out to write this (I am actually a very busy guy), is to highlight to everyone that Cloud lock in/lock out is real, but it doesn’t have to bring fear into your life. All the software we used here is 100% free and open source (yes yes, the Clouds are not free), and this can be achieved by anybody these days. This article also wasn’t designed to be a walk through with exact steps, it is a high level overview of what we did and how we did it.

Again to make it very clear, this is not an article to bad mouth Google and praise Amazon. I’m personally a huge fan of Google. To be honest, we will probably be re-platforming to Digital Ocean and Azure so we can test the waters (or raindrops) on the other Clouds, and then probably back to Amazon to take advantage of their spot instances.

This is the wonderfulness of this design/architecture as it allows us to move around, to explore, encouraging us to use new technology, all while feeling confident and safe.