Introducing Productboard Pulse. Exec-level insights into what your customers need, powered by AI.
Last year, the infrastructure team at Productboard took on the daunting task of migrating our existing Kubernetes infrastructure from a self-managed kOps cluster to Amazon’s EKS. In this article, we’ll describe the reasons for this decision, the hurdles we encountered along the way, and what we learned from them.
Our Kubernetes cluster was originally built some three years ago as the company was starting its first growth spurt, when the newly created Infrastructure team had to move our applications from a PaaS solution to a more scalable one. Our resident K8s wizard at the time had experience with kOps and decided to build the infrastructure on top of that.
Over the next two years, our business grew, and our infrastructure with it – eventually spanning three clusters, a range of 150-250 nodes, and around 2,000 containers. Unfortunately, we fell into the typical trap of neglecting our technological foundations.
There was a lack of shared knowledge about the inner workings of our kOps cluster, which surfaced after the original architect left the company in late 2020. The clusters worked without a hitch, but we were hesitant to make larger changes. We performed general maintenance and smaller upgrades but were aware of unresolved kOps issues, which blocked us from further updates.
In short, we grew complacent and went with the “if it ain’t broke, don’t fix it” strategy. This came back to bite us big time last September when we had our longest outage yet.
In mid-September, one of our infra engineers was doing some S3 optimization work, cleaning up larger buckets and setting up lifecycle rules on objects we didn’t need to store for years or wanted to move to Glacier.
One of the locations we targeted for improvement was our etcd backups bucket, which had backups dating all the way back to 2019, when we first spun up the cluster. We set up a lifecycle rule to clean up all files older than 30 days in the /backups
directory, as we didn’t anticipate needing to rollback a cluster to an earlier state.
What we didn’t notice was that located in the /backups
directory, among the first objects created, was the directory /control
containing two files: etcd-cluster-created
and etcd-cluster-spec
. (We were not alone in missing this). These two files are critical for the etcd cluster containing the timestamp when it was created, the member count, and the version. While not in use during standard operation, these two files are referenced by the etcd manager when there is a connection loss between the members and get disassociated from the cluster.
Around a week after our backup retention rules were put in place, a small network issue caused etcd members to lose contact and default to the S3 files for cluster verification. When the members found no such files, they assumed there was no existing cluster, and it was necessary to create a new one and began the process of creating a fresh cluster.
At this point, the infra team started getting alerts about services going down, and upon connecting to the Kubernetes cluster to investigate, they found the cluster empty apart from the default kube-system services. All hands were on deck.
Here is when the first painful point of our lack of knowledge appeared. We realized that even though we have some form of etcd backups, we never really tested their validity and the recovery procedure to use them. In addition, we weren’t sure what the root cause was, and our first investigations pointed to a possible CNI issue, ruling out a recovery anyway. Due to this, the call was made to rebuild the cluster from scratch and redeploy all of our services.
We needed to redeploy all of our core infrastructure tools and then follow up with our production applications. This process took much longer than expected, as we encountered blockers stemming from previous manual steps on multiple occasions.
Things like manual secrets, custom resources, and other steps were done in the past couple of years as fixes and updates without being reflected in code. This made the process of recovery extremely frustrating. We were down for a grand total of five hours. To pour salt in the wound, after the outage, we tested the etcd recovery and managed to recover the cluster in under 20 minutes.
Fortunately, the outage didn’t catch us completely unprepared. We had been discussing moving away from kOps for a couple months at that time. We knew we weren’t happy with the complexity of the control plane and had already decided that the managed control plane in EKS was more suitable.
The research that we had already done allowed us to fast-track the migration, drop everything else, and start working on it. There were two main requirements: get it done as fast as possible and have no impact on our customers.
The first problem we ran into was during planning when we were discussing how we would run tests against the migrations. Our staging cluster is not a perfect copy of our production cluster. Staging is missing some functionalities that are present in production, like customer spaces, for example. This makes it impossible to validate the migration process on staging in its entirety.
Therefore, we made the decision to skip our staging cluster and focus the migration on our two other clusters: ops and production. Production is self-explanatory – it contains all of our customer-facing applications and infrastructure. The Ops cluster contains various infrastructure-related tooling, and, critically, our CI runners.
The Ops cluster was our first mission as it has no impact on production. An outage wouldn’t affect the currently deployed services on production but would hinder development for our engineers. We built an EKS terraform module on top of the existing upstream terraform-aws-modules/eks/aws module to suit our needs. This part of the project went fairly smoothly, with the only annoying part being the process of importing the existing VPC, network, and security-related Terraform code from the existing kOps state files to new, separate states.
The fun began when we set our eyes on the Production cluster. The process was quite simple at first: create the new cluster and deploy all of our applications. The major obstacles we had to overcome were DNS management and the actual traffic switch.
Since we were using external-dns, our records were stored in a Route53 zone. We had to make two copies of the zone — one to be integrated with the cluster, the other to serve as a balancer between the two.
A balancer zone was needed since we decided to leverage Route53’s weighted routing to facilitate the traffic failover. This, while limited by TTL, allowed us to gradually let more traffic into the new cluster and still have the ability to fairly quickly pull the plug on the migration and rollback. Because our application is actively creating new ingress DNS records in the form of a CNAME, the problem with this setup was that we had to sync the records in real-time to prevent a split-brain scenario.
In the end, the migration itself went fairly smoothly, and in less than two hours, everything was migrated. Now it was time to reflect on why we ended up in this mess in the first place.
This migration was without a doubt one of the most complicated and intense experiences our team has had to overcome so far.
We had to accelerate a project we had originally planned to complete in six months down to three months. We had to work with our engineers to design and plan a migration strategy that would ensure our customers would have no idea it was actually happening.
We learned a lot along the way, and some of our most important lessons are outlined below.
Break knowledge silos by sharing knowledge, writing proper architecture documentation, and rotating responsibilities for various tasks.
It’s easy to fall into the trap of one person managing a specific tool or area of the stack and then solely dealing with issues around it. Make a conscious effort to spread issues among the team even at the expense of giving the issue to someone who will not complete it in the “optimal” time frame.
Run periodic docs reviews. We tend to move fast and focus on building and shipping new features. It’s important to dedicate time in the sprint, quarter, or some other time period to updating the documentation with these new features. It’s time-consuming, but it’s important and can save your bacon in the long run.
Having a safe environment where the infrastructure team can test large-scale changes is important. It’s crucial that this testing environment mirrors the functionality of the production one.
Ideally, an automated suite of tests would be run simulating the possible interactions a customer can have with your system, but manual tests will do as long as they reproduce the interactions done in production.
If this environment is missing or incomplete, you can never really make changes with 100% confidence that they won’t break something or that the customer won’t be affected.
Testing out recovery strategies is crucial. Run drills and test out various scenarios. Write proper guides and step-by-step procedures and give them to other engineers to try out. Make sure all of your on-call engineers can understand and follow these procedures.
Automate as much of the infrastructure as possible. Infrastructure as Code should be the mantra of your team. Over the years, small manual changes, edits, and fixes tend to pile up, and it’s easy to lose track of them.
Test the spin-up of a new cluster/environment from scratch to find out where you hit a wall, whether it’s a missing secret that was created manually or something else entirely, and implement the fixes for these problems. Some manual steps may be necessary, especially if we’re creating a cluster from scratch, so make sure these steps are outlined in the documentation.
We’d actually suggest setting a target “mean time to recovery” and working towards this goal. This will naturally force you to automate as much as possible and allow you to treat your clusters as cattle instead of pets.
Interested in joining our growing team? Well, we’re hiring across the board! Check out our careers page for the latest vacancies.