XRPL Labs Service Outage Notice

Incident Report for XRPL Labs, XUMM

Postmortem

Postmortem: March 1 Infrastructure Outage

Executive Summary

On March 1, 2025, XRPL Labs experienced a significant service disruption affecting the Xaman app backend, XRPL Cluster public nodes, and Xahau public nodes and hubs. The incident began with a networking switch malfunction in our primary data center that created a cascade of failures across multiple infrastructure layers. Services were impacted for varying durations, from 5 minutes to 2 hours. While our infrastructure experienced downtime, the XRP Ledger and Xahau networks themselves remained fully operational during this period.

Root Cause: A network switch flooding packets triggered protective measures that unexpectedly cascaded into virtualization cluster failures and network connectivity issues for critical services.

Resolution: The issue was resolved by decoupling the virtualization cluster nodes and manually restoring network connectivity to virtual machines.

Detailed Timeline and Technical Analysis

Initial Incident (Late Afternoon CET, March 1)

  • Our monitoring systems detected increasing latency across switch ports in the primary data center
  • A networking switch began producing an abnormally high volume of network packets, effectively flooding the infrastructure
  • We made the decision to remove the problematic switch from production

Cascade of Failures

  1. Network Reconfiguration Issues

    1. When the switch was removed, RSTP (Rapid Spanning Tree Protocol) began recalculating network routes
    2. RSTP unexpectedly shut down additional switch ports that were actually functioning normally
    3. This created a brief disconnection of infrastructure management interfaces (Dell iDRAC and Proxmox management VLANs)
  2. Virtualization Cluster Failure

    1. When management interfaces came back online, Proxmox hosts failed to reconnect to each other
    2. Virtual machines lost their public internet connections
    3. Service impact varied: some VMs recovered within minutes, others remained offline
  3. DNS Resolution Problems

    1. Proxmox nodes attempted to restore the cluster but encountered DNS resolution failures
    2. Nodes could not resolve each other's hostnames to IP addresses
    3. We implemented a temporary fix by adding IP addresses to local hosts files
  4. Virtualization Network Issues

    1. Despite DNS fixes, Proxmox nodes still would not come back online
    2. The system refused to apply network configuration changes because the cluster was not active
    3. Network connectivity for virtual machines remained down despite physical network availability

Resolution Steps

To resolve the outage, we:

Disabled clustering on all Proxmox nodes

  1. Manually reconfigured virtual networking on a node-by-node basis
  2. Restored service to all offline machines

Unrelated Coincidental Issues

During the same timeframe, we also encountered:

  • Two power supply failures in servers (redundant supplies, so no additional impact)
  • Two history machines with filled storage due to excessive logging from the outage

Follow-up Actions

We have identified several high-priority investigation and remediation tasks:

  1. Super High Priority:

    1. Determine why Proxmox disabled public networking on guest VMs while physical networks were operational
  2. High Priority:

    1. Investigate why functional switch ports were incorrectly disabled by RSTP
  3. Medium Priority:

    1. Re-enable clustering on Proxmox hosts with improved resilience
  4. Low Priority:

    1. Investigate DNS failure mechanisms
    2. Examine the original switch that began flooding packets

Preventative Measures

Based on our initial findings, we are implementing the following measures:

  1. Pre-configuring local hosts files on all nodes to reduce DNS dependency
  2. Enhancing network redundancy and isolation between management and production traffic
  3. Improving automated recovery procedures for virtualization clusters
  4. Adding monitoring specifically for virtualization network states

We appreciate your understanding and patience during this outage. While we pride ourselves on maintaining highly redundant infrastructure, we acknowledge that no system can achieve 100% uptime. We remain committed to learning from this incident and strengthening our systems to minimize future disruptions.

Yours,
Wietse - XRPL Labs CEO

Posted Mar 01, 2025 - 22:27 CET

Resolved

Dear Xaman app users and XRPL/Xahau network participants,

We experienced a service disruption today affecting Xaman app services and our public nodes. The issue has been fully resolved, and all services are now operational.

What happened?

A networking issue in our primary data center caused a cascading failure. The outage lasted between 5 minutes and 2 hours for different services. During this time, you may have been unable to view recent transactions or submit new ones through the Xaman app.

Important to know:

The XRP Ledger and Xahau blockchains continued to operate normally. No funds were at risk during this outage. All transactions that were properly submitted to the network (even through other services) were processed.

We apologize for any inconvenience this may have caused and are implementing additional safeguards to prevent similar issues in the future.

For those interested in technical details, a full postmortem is available.

The XRPL Labs Team
Posted Mar 01, 2025 - 22:26 CET
This incident affected: XUMM API / SDK (XUMM Developer API/SDK) and XRP Ledger - Public nodes (xrplcluster.com (XRPL Mainnet)).