On December 2nd, 2021 between 14:45UTC and 17:05UTC clients experienced degraded performance and availability of the ShotGrid service.
We observed an increased error rate shortly after a new version of ShotGrid was released. Whilst investigating we took the decision to rollback to the previous version of ShotGrid out of an abundance of caution.
Our investigation subsequently identified that an infrastructure maintenance job which had no expected impact on the ShotGrid service was triggering rapid process respawning on our app components, resulting in increased memory and CPU pressure and impairing the ability of the app to process client requests.
Clients sporadically received HTTP 500, 502, 503 and 504 responses when making requests to ShotGrid during the incident window.
We are improving our infrastructure code to prevent unnecessary process respawning occurring on ShotGrid app components.