Introduction:
On August 20, 2024, between 9:20 MT and 10:05 MDT and from 14:45 MT - 15:15 MDT, customers experienced errors when performing actions that required a database connection.
Issue Summary:
Customers affected during the outage were unable to perform actions that utilize connections to the database. The connections were held up due to a memory limit on the database servers.
Resolution:
Engineering teams stabilized the environment, determined the cause, and deployed a fix to address an unexpected database state.
Root Cause:
A host entered an unexpected state after handling a series of requests.
Solution and Mitigation:
The system was stabilized by restarting the impacted host. Monitors were put in place to detect this issue, and the appropriate teams are working on improving request handling.
Note: Additionally, we will be performing additional maintenance during a brief 30 minute maintenance window on Friday, August 23rd beginning at 22:00 MDT, to address additional memory management parameters.
Conclusion:
We recognize this had an impact for our US customers during normal working hours. We are committed to enhancing our architecture to prevent smaller services being able to impact overall service availability. We thank you for your patience as we worked through this service disruption.