Sui Network Outage Resolved Swiftly Following Validator Collaboration

Peter Zhang  Nov 22, 2024 06:28  UTC 22:28

0 Min Read

The Sui Mainnet recently faced a significant outage, halting all network operations for a few hours due to a technical glitch. The incident, which occurred on November 21, 2024, between 1:15 and 3:45 am PT, involved a crash loop affecting all validators, preventing any transaction processing, according to the Sui Foundation.

Understanding the Incident

The issue stemmed from a bug in the congestion control code, specifically an assert! statement, which triggered a crash when the estimated execution cost was zero. This problem was linked to the TotalGasBudgetWithCap mode, briefly enabled in protocol version 63 and reintroduced in version 68. The bug manifested when the network received a transaction with a mutable shared object input and zero MoveCall commands, causing all validators to crash.

The Role of Congestion Control

Congestion control in the Sui network is crucial for managing transaction rates to shared objects, ensuring the network does not become overloaded. This system was recently upgraded to enhance shared object utilization by accurately estimating transaction complexity. However, the upgrade inadvertently introduced the bug causing the outage.

Resolution and Response

Upon identifying the problem, Sui engineers promptly devised a fix. The corrective code, detailed in PR #20365, was deployed to both the Mainnet and Testnet in versions v1.37.4 and v1.38.1, respectively. The rapid deployment was facilitated by an outstanding response from the validator community, enabling the network to resume operations within 15 minutes of releasing the fix.

Lessons and Future Improvements

This incident underscored the effectiveness of Sui's incident detection and response mechanisms. Automated alerts promptly notified engineers, who collaborated with the validator community to address the issue swiftly. Moving forward, Sui plans to enhance its testing systems to prevent similar bugs and streamline its build workflows to reduce incident response times.

For more detailed information, please visit the Sui Foundation.



Read More