Postmortem: March 1 Infrastructure Outage
Executive Summary
On March 1, 2025, XRPL Labs experienced a significant service disruption affecting the Xaman app backend, XRPL Cluster public nodes, and Xahau public nodes and hubs. The incident began with a networking switch malfunction in our primary data center that created a cascade of failures across multiple infrastructure layers. Services were impacted for varying durations, from 5 minutes to 2 hours. While our infrastructure experienced downtime, the XRP Ledger and Xahau networks themselves remained fully operational during this period.
Root Cause: A network switch flooding packets triggered protective measures that unexpectedly cascaded into virtualization cluster failures and network connectivity issues for critical services.
Resolution: The issue was resolved by decoupling the virtualization cluster nodes and manually restoring network connectivity to virtual machines.
Detailed Timeline and Technical Analysis
Initial Incident (Late Afternoon CET, March 1)
- Our monitoring systems detected increasing latency across switch ports in the primary data center
- A networking switch began producing an abnormally high volume of network packets, effectively flooding the infrastructure
- We made the decision to remove the problematic switch from production
Cascade of Failures
Network Reconfiguration Issues
- When the switch was removed, RSTP (Rapid Spanning Tree Protocol) began recalculating network routes
- RSTP unexpectedly shut down additional switch ports that were actually functioning normally
- This created a brief disconnection of infrastructure management interfaces (Dell iDRAC and Proxmox management VLANs)
Virtualization Cluster Failure
- When management interfaces came back online, Proxmox hosts failed to reconnect to each other
- Virtual machines lost their public internet connections
- Service impact varied: some VMs recovered within minutes, others remained offline
DNS Resolution Problems
- Proxmox nodes attempted to restore the cluster but encountered DNS resolution failures
- Nodes could not resolve each other's hostnames to IP addresses
- We implemented a temporary fix by adding IP addresses to local hosts files
Virtualization Network Issues
- Despite DNS fixes, Proxmox nodes still would not come back online
- The system refused to apply network configuration changes because the cluster was not active
- Network connectivity for virtual machines remained down despite physical network availability
Resolution Steps
To resolve the outage, we:
Disabled clustering on all Proxmox nodes
- Manually reconfigured virtual networking on a node-by-node basis
- Restored service to all offline machines
Unrelated Coincidental Issues
During the same timeframe, we also encountered:
- Two power supply failures in servers (redundant supplies, so no additional impact)
- Two history machines with filled storage due to excessive logging from the outage
Follow-up Actions
We have identified several high-priority investigation and remediation tasks:
Super High Priority:
- Determine why Proxmox disabled public networking on guest VMs while physical networks were operational
High Priority:
- Investigate why functional switch ports were incorrectly disabled by RSTP
Medium Priority:
- Re-enable clustering on Proxmox hosts with improved resilience
Low Priority:
- Investigate DNS failure mechanisms
- Examine the original switch that began flooding packets
Preventative Measures
Based on our initial findings, we are implementing the following measures:
- Pre-configuring local hosts files on all nodes to reduce DNS dependency
- Enhancing network redundancy and isolation between management and production traffic
- Improving automated recovery procedures for virtualization clusters
- Adding monitoring specifically for virtualization network states
We appreciate your understanding and patience during this outage. While we pride ourselves on maintaining highly redundant infrastructure, we acknowledge that no system can achieve 100% uptime. We remain committed to learning from this incident and strengthening our systems to minimize future disruptions.
Yours,
Wietse - XRPL Labs CEO