APAC instances intermittently experiencing "400 Bad Request" errors when accessing Admin Console
Introduction:
On May 2, 2024 between 23:50 UTC - 1:10 UTC, some customers experienced “400 Bad Request” errors when performing any action requiring a database connection.
Issue Summary:
Customers affected during the outage were unable to process user login requests, or other core database requests. The connections were held up due to some code that would lock database tables. These locks prevented affected users from performing other actions that needed the database.
Resolution:
A fix was implemented that prevents the unintended database lock. We walked the change through to production and monitored it until it was evident that the lock ups were no longer occurring.
Root Cause:
A database migration thought to be safe, locked up a set of database tables ultimately preventing new connections being made to the database.
Solution and Mitigation:
A fix was implemented to stop the immediate issue. We are looking into unique ways to empower us to keep core functionality up and running even if a known service is failing. We are also taking a deeper dive into our architecture and removing any unnecessary dependencies between our internal systems. Lastly we are reviewing our testing environment and finding ways to better simulate user behavior and give us an even clearer picture of how we might expect our changes to affect our users.
Conclusion:
We recognize we specifically had a major impact for our customers in Australia during their normal working hours, as well as our customers who are 24/7 businesses. We’re committed to discovering new roll out strategies that keep in mind all customers’ working hours. We are also committed to investigating our architecture to prevent smaller services being able to lock up core functionality. We thank you all for your patience through a learning moment, and we are eager to actively improve our systems to continue to help all on their digital transformation journey.