Los Angeles DC1 intermittent service instability

Incident Report for Bandwagon Host Status

Resolved

All scheduled updates have been successfully rolled out. We will continue to closely monitor DC1 to make sure there are no outstanding issues.
Posted Jul 31, 2025 - 23:27 PDT

Update

We have successfully completed the first stage and will be closely monitoring the situation for the next 24 hours. After 24 hours and if no new issues show up, we will begin the next stage of updates.
Posted Jul 27, 2025 - 21:11 PDT

Identified

Over the past few days, we have been receiving reports of intermittent issues affecting customers in our Los Angeles DC1 facility (nodes v66XX). Symptoms ranged from elevated latency to occasional VM reboots.

Since the first reports came in, we have been investigating this issue non-stop and have given it absolute top priority, putting aside any other ongoing projects.

The main complication was that we were not able to reproduce this issue outside the production environment, the instability would not reproduce under any lab load - artificially created dummy VMs running on exactly the same hardware and generating similar load do not trigger this issue. The only reliable data we have been able to establish is that this issue is related to a specific hardware configuration combined with a specific range of Linux kernel versions. Having no reliable way to reproduce it significantly slowed down our development process.

A fix is finally ready and we will be applying it in stages to all VMs running in DC1. Due to the scale of this update, we estimate that it will take us up to 7 days to complete.

We do not anticipate any service interruptions throughout this rollout. Having said that, we will do our best to minimize any disruption should it occur.
Posted Jul 24, 2025 - 16:43 PDT