Introduction:
During the week of August 21, 2023, a subset of our customers across the US region had experienced intermittent difficulty accessing their instances. We’ve traced the root cause to the rollout of our new Identity & Access Management (IAM) service and a following data migration between IAM platforms.
Issue Summary:
Impact: A subset of customer instances were experiencing errors upon login.
Affected Regions: US
Root Cause:
Our in-depth review revealed that the SCIM services were experiencing a larger load than was previously observed. Our DevOps teams opened a case with AWS to initiate assistance in addressing the manner in which our database was processing a higher than normal connection rate. In working with AWS, guidance was provided to release the locks, which resolved the issue. Services were restored around 8:25 AM MDT on August 22, 2023, with the exception of SCIM, which was brought online approximately at 9:00 AM MDT. We continued to see this behavior intermittently occur several times over the next week.
Solution and Mitigation:
The scenarios realized during these outages have been added as additional test cases to our testing scenarios which are used to validate ongoing changes, as well as detect regressions over time. The Engineering and DevOps teams have implemented optimization steps to mitigate the locking issues when higher than expected loads of SCIM activity affect the Service. These optimizations were deployed to production on 08/31/2023
Conclusion:
This issue has been resolved. We have implemented mitigating measures to prevent recurrence. Additional environmental monitoring has been incorporated within our internal processes. In the event of any recurrence, we will proactively announce on our Service Status page.