Siteimprove Platform Login Errors (SSO Users)

Incident Report for Siteimprove

Postmortem

Executive Summary: On May 6, 2026, an infrastructure failure in our Kubernetes environment caused SSO login failures for a subset of users for approximately 42 minutes. All other platform functionality was unaffected. The issue was caused by a loss of quorum in an internal cluster's control plane, compounded by a node that had been silently degraded. The incident was resolved by replacing the affected nodes, and the system has been operating normally since.

Incident Overview: The issue originated from an internal infrastructure cluster that acts as a connectivity bridge between our environments. Two of the cluster's three control plane nodes became non-functional — one had been in a degraded state, and a second failed during normal operations. With two of three nodes down, the cluster could no longer coordinate its workloads, which disrupted the network path required for SSO logins for some users. Some of our internal tooling was also affected. To resolve the issue, the team replaced the unresponsive nodes, which restored normal operations. Login functionality recovered shortly after.

Impact: Some users logging into the platform via SSO experienced failures from approximately 2026-05-06T14:34Z to 2026-05-06T15:16Z (~42 minutes).

Detection: The incident was detected at 2026-05-06T14:34Z when our automated monitoring system identified connectivity failures in the login journey, which alerted the operations team.

Response: Our operations team began investigating immediately. The affected control plane nodes were identified and replaced, with full recovery confirmed via automated tests at 2026-05-06T15:16Z. A status page update was posted during the incident.

Root Cause: Two of three control plane nodes in an internal cluster became unavailable at the same time — one had been silently degraded, and a second failed independently. The monitoring pipeline that should have detected the degraded node ahead of time was itself not functioning correctly. Because it only logged on failure and not on success, its silence was indistinguishable from normal operation, and the team had no way to know it had stopped working.

Follow-up actions updated the monitoring pipeline to emit success signals so that their absence can be alerted on, improving overall control plane monitoring coverage, and introducing scheduled control plane node replacements as a preventive measure.

Posted May 14, 2026 - 17:54 UTC

Resolved

The issue affecting SSO logins to the Siteimprove platform is all sorted.

We’re sorry for holding you up! If you have any additional questions or feedback, please submit a new support ticket through our Help Center ("?" button) located at the top right-hand-side of the Siteimprove platform.

Posted May 06, 2026 - 16:08 UTC

Monitoring

A fix has been implemented and we are monitoring the results. SSO Users should now be able to successfully log into the Siteimprove platform via siteimprove.com and my2.siteimprove.com disruption free.

Posted May 06, 2026 - 15:30 UTC

Identified

Good news! We’ve found the issue and we’re working on a fix. Thanks for your patience!

We'll continue to keep you updated.

Posted May 06, 2026 - 15:07 UTC

Investigating

We are currently experiencing issues where some customers are unable to log in to the Siteimprove platform using SSO via siteimprove.com and my2.siteimprove.com.

Impact:

- Users who are already logged in should not experience disruption unless they log out or their session times out.
- New login attempts may fail.

Our Development Team is actively working to identify the root cause and resolve the issue as quickly as possible.

We apologize for this inconvenience. We'll continue to keep you updated.

Posted May 06, 2026 - 14:53 UTC

This incident affected: Platform.