So โ downtime.
The bad news: Login and registration on Runway experienced downtime. The good news: Your applications kept running.
Let’s break down what happened and what we learned.
Last week, we set out to update our infrastructure, including operating system (OS) and Kubernetes upgrades. OS updates typically run smoothly (thanks, Flatcar), and Kubernetes upgrades, using k0s and k0sctl, have been similarly reliable for us.
This time, things didn’t go as planned.
We began with staging (as one does) and moved on to update a support cluster where critical services like authentication (via Kratos) run. The first version upgrades went fine, but the last one failed with an obscure error about pods in the kube-system
namespace not being ready.
Upon investigation, we found CoreDNS stuck in a restart loop. It would start up, fail its health check, and restart โ over and over. Given CoreDNS’s central role, it was easy to assume it was the culprit.
After carefully observing its startup behavior, we realized CoreDNS couldn’t connect to the Kubernetes API. Digging deeper, we saw that any pod attempting to talk to the API (in any namespace) encountered the same error:
Error while initializing connection to Kubernetes apiserver.
This most likely means that the cluster is misconfigured
(e.g., it has invalid apiserver certificates or service
accounts configuration).
Reason: Get "https://10.96.0.1:443/version": dial tcp 10.96.0.1:443: i/o timeout
Examples of pods that failed include ingress services, cert-manager or monitoring components like a Grafana Agent.
Kubernetes abstracts complex container orchestration across multiple servers, making them appear as a unified system. This abstraction is powerful but comes at the cost of complexity.
CoreDNS, for example, relies on the Kubernetes API to generate internal DNS entries like service.namespace.svc.cluster.local
. These DNS entries enable service discovery (SD), allowing pods to communicate reliably even when their underlying IPs change.
Without SD, we’d be back to the โold daysโ of maintaining static IPs in tools like Netbox. This highlights why CoreDNS is a critical component โ a failure here disrupts all service discovery.
But if CoreDNS is a single point of failure (SPOF), the Kubernetes API itself is the ultimate SPOF. Without it, critical operations grind to a halt.
I know what some of you might be thinking: “Updates on a Friday? Really? What were you thinking?”
Here’s the thing โ we firmly believe in doing what needs to be done when it needs doing. Deployments happen continuously as changes are introduced to the main branch. Holding updates or features in a separate branch only adds complexity, increasing the chances of multiple things going wrong when everything is eventually rolled out together.
This particular update was driven by the need to take advantage of new infrastructure features. And let’s be honest, there’s never truly a โperfect timeโ for these things. However, we take pride in fostering an engineering culture where writing robust tests and instrumenting every layer of the stack is baked into our definition of done.
Infrastructure updates are inherently more complex than rolling back an application update, which is why we took a measured approach. We started with staging and testing environments earlier in the week, followed by the support cluster on Friday noon, rather than diving straight into the full production setup that powers our customer apps.
While it wasn’t without its challenges, this approach ensured we could move forward methodically and minimize the risk to production environments.
After the upgrade failed we spent the rest of Friday debugging:
tcpdump
iptables
rules and discovered nuances between iptables
and nftables
Here’s what worked and what didn’t:
After hours of debugging on Friday we gave up and we decided to rebuild the cluster on Saturday morning โ far too much downtime for comfort. For that, we deeply apologize.
In the moment, one desperate solution was to manually override service discovery. Pods automatically receive the following environment variables for API access:
KUBERNETES_SERVICE_HOST
KUBERNETES_SERVICE_PORT
KUBERNETES_PORT
By overriding these with working IP/PORT combinations in pod manifests, we could temporarily restore access to the API. But as always, your mileage may vary with such hacks.
As engineers, it’s hard to admit when the best solution is to start fresh. However, this is often the reality with Kubernetes.
We’ve been running containers for others since 2019 and have extensive experience with Docker, Docker Swarm, and Kubernetes. Yet, moments like these are humbling, especially when customers couldn’t log in or register on Runway โ an outcome we always aim to avoid.
The silver lining? Despite the downtime of the integral auth service, all user applications continued to run without interruption.
We’ve also learned to prioritize faster recovery solutions when faced with critical infrastructure issues.
To our customers: thank you for your patience and understanding. We’re sorry for the downtime, and we’re committed to applying the lessons from this experience to avoid such incidents in the future.
Here’s to focusing on more product updates and less infrastructure drama going forward!