Issue Summary:
Between 11:20am and 1:30pm UTC, some instances in our Sydney region began experiencing intermittent 404 errors.
During a routine scaling of our servers, new servers were not able to complete initialization leading to a subset of hosts becoming overwhelmed. As a result, a degraded experience for customers manifested through intermittent 404 errors from a portion of systems.
Resolution:
Our monitoring notified our team members who immediately began investigating the issue. Traffic was redirected from the affected systems to reduce the impact while the cause was further investigated. A reversion to a known working configuration was deployed while the identified cause was corrected.
Root Cause:
A recent change to an initialization script which is utilized during scaling events caused a significant delay during server startup which did not complete due to system load during the morning hours.
Mitigation:
After an audit of the adjustment, it was identified that the adjustment was safe to revert while the changes undergo further testing. We are performing additional testing on the proposed changes to ensure we will not experience delays during future scaling events.
Conclusion:
We acknowledge the impact this had on customers within the Sydney region. We are committed to improving our infrastructure scaling processes to prevent future recurrence. Thank you for your patience as we resolved this service disruption.