Post-Mortem: November 22nd–23rd, 2024

Posted by: Till Klampaeckel
25 November 2024

🛠️ - post-mortem

So — downtime.

The bad news: Login and registration on Runway experienced downtime. The good news: Your applications kept running.

Let’s break down what happened and what we learned.

What Happened?

Last week, we set out to update our infrastructure, including operating system (OS) and Kubernetes upgrades. OS updates typically run smoothly (thanks, Flatcar), and Kubernetes upgrades, using k0s and k0sctl, have been similarly reliable for us.

This time, things didn’t go as planned.

We began with staging (as one does) and moved on to update a support cluster where critical services like authentication (via Kratos) run. The first version upgrades went fine, but the last one failed with an obscure error about pods in the kube-system namespace not being ready.

Upon investigation, we found CoreDNS stuck in a restart loop. It would start up, fail its health check, and restart — over and over. Given CoreDNS’s central role, it was easy to assume it was the culprit.

After carefully observing its startup behavior, we realized CoreDNS couldn’t connect to the Kubernetes API. Digging deeper, we saw that any pod attempting to talk to the API (in any namespace) encountered the same error:

Error while initializing connection to Kubernetes apiserver.
This most likely means that the cluster is misconfigured
(e.g., it has invalid apiserver certificates or service
accounts configuration).

Reason: Get "https://10.96.0.1:443/version": dial tcp 10.96.0.1:443: i/o timeout

Examples of pods that failed include ingress services, cert-manager or monitoring components like a Grafana Agent.

Backstory

Kubernetes abstracts complex container orchestration across multiple servers, making them appear as a unified system. This abstraction is powerful but comes at the cost of complexity.

CoreDNS, for example, relies on the Kubernetes API to generate internal DNS entries like service.namespace.svc.cluster.local. These DNS entries enable service discovery (SD), allowing pods to communicate reliably even when their underlying IPs change.

Without SD, we’d be back to the “old days” of maintaining static IPs in tools like Netbox. This highlights why CoreDNS is a critical component — a failure here disrupts all service discovery.

But if CoreDNS is a single point of failure (SPOF), the Kubernetes API itself is the ultimate SPOF. Without it, critical operations grind to a halt.

Updates on Friday?!

I know what some of you might be thinking: “Updates on a Friday? Really? What were you thinking?”

Here’s the thing — we firmly believe in doing what needs to be done when it needs doing. Deployments happen continuously as changes are introduced to the main branch. Holding updates or features in a separate branch only adds complexity, increasing the chances of multiple things going wrong when everything is eventually rolled out together.

This particular update was driven by the need to take advantage of new infrastructure features. And let’s be honest, there’s never truly a “perfect time” for these things. However, we take pride in fostering an engineering culture where writing robust tests and instrumenting every layer of the stack is baked into our definition of done.

Infrastructure updates are inherently more complex than rolling back an application update, which is why we took a measured approach. We started with staging and testing environments earlier in the week, followed by the support cluster on Friday noon, rather than diving straight into the full production setup that powers our customer apps.

While it wasn’t without its challenges, this approach ensured we could move forward methodically and minimize the risk to production environments.

Debugging

After the upgrade failed we spent the rest of Friday debugging:

Captured network traffic with tcpdump
Examined iptables rules and discovered nuances between iptables and nftables
Investigated how our Container Network Interface (CNI), kube-router (with kube-proxy), interacts with Kubernetes

Here’s what worked and what didn’t:

Pod-to-pod communication (via direct IPs): ✅
Pod-to-service communication (via service IPs): ✅
Host-to-Kubernetes API communication (via direct IPs): ✅
Pod-to-Kubernetes API communication (via direct IPs): ❌

After hours of debugging on Friday we gave up and we decided to rebuild the cluster on Saturday morning — far too much downtime for comfort. For that, we deeply apologize.

Hacks & Workarounds

In the moment, one desperate solution was to manually override service discovery. Pods automatically receive the following environment variables for API access:

KUBERNETES_SERVICE_HOST
KUBERNETES_SERVICE_PORT
KUBERNETES_PORT

By overriding these with working IP/PORT combinations in pod manifests, we could temporarily restore access to the API. But as always, your mileage may vary with such hacks.

Context

As engineers, it’s hard to admit when the best solution is to start fresh. However, this is often the reality with Kubernetes.

We’ve been running containers for others since 2019 and have extensive experience with Docker, Docker Swarm, and Kubernetes. Yet, moments like these are humbling, especially when customers couldn’t log in or register on Runway — an outcome we always aim to avoid.

The silver lining? Despite the downtime of the integral auth service, all user applications continued to run without interruption.

Lessons Learned

Switching to Calico: When rebuilding the cluster, we migrated to Calico to unify the CNI across all clusters. Running Calico is not fun either, but we have more experience than running kube-router/-proxy. So in no way this switch should be perceived as an argument against kube-router, it is about what we are most familiar with.
Testing Documentation: Our internal documentation, though sparse in places, proved effective during the rebuild. In hindsight, spending an hour to rebuild rather than troubleshooting endlessly would have been the better choice. In the heat of the moment, the simplest solution — starting fresh — often feels counterintuitive. When you’re deep in troubleshooting, your focus narrows, and the idea of rebuilding can feel like giving up rather than progressing. This is the human factor in downtime: the tendency to keep digging, to chase complexity over simplicity, and to resist hitting reset because it feels like admitting defeat.
Post-Mortem Testing: We’re keeping the defunct cluster around to run tests and further our understanding of what went wrong.

We’ve also learned to prioritize faster recovery solutions when faced with critical infrastructure issues.

Thanks

To our customers: thank you for your patience and understanding. We’re sorry for the downtime, and we’re committed to applying the lessons from this experience to avoid such incidents in the future.

Here’s to focusing on more product updates and less infrastructure drama going forward!

Company