Executive Summary: On May 6, 2026, an infrastructure failure in our Kubernetes environment caused SSO login failures for a subset of users for approximately 42 minutes. All other platform functionality was unaffected. The issue was caused by a loss of quorum in an internal cluster's control plane, compounded by a node that had been silently degraded. The incident was resolved by replacing the affected nodes, and the system has been operating normally since.
Incident Overview: The issue originated from an internal infrastructure cluster that acts as a connectivity bridge between our environments. Two of the cluster's three control plane nodes became non-functional — one had been in a degraded state, and a second failed during normal operations. With two of three nodes down, the cluster could no longer coordinate its workloads, which disrupted the network path required for SSO logins for some users. Some of our internal tooling was also affected. To resolve the issue, the team replaced the unresponsive nodes, which restored normal operations. Login functionality recovered shortly after.
Impact: Some users logging into the platform via SSO experienced failures from approximately 2026-05-06T14:34Z to 2026-05-06T15:16Z (~42 minutes).
Detection: The incident was detected at 2026-05-06T14:34Z when our automated monitoring system identified connectivity failures in the login journey, which alerted the operations team.
Response: Our operations team began investigating immediately. The affected control plane nodes were identified and replaced, with full recovery confirmed via automated tests at 2026-05-06T15:16Z. A status page update was posted during the incident.
Root Cause: Two of three control plane nodes in an internal cluster became unavailable at the same time — one had been silently degraded, and a second failed independently. The monitoring pipeline that should have detected the degraded node ahead of time was itself not functioning correctly. Because it only logged on failure and not on success, its silence was indistinguishable from normal operation, and the team had no way to know it had stopped working.
Follow-up actions updated the monitoring pipeline to emit success signals so that their absence can be alerted on, improving overall control plane monitoring coverage, and introducing scheduled control plane node replacements as a preventive measure.