Service Degradation - ShotGrid
Incident Report for Flow Production Tracking
Postmortem

What happened?

On December 2nd, 2021 between 14:45UTC and 17:05UTC clients experienced degraded performance and availability of the ShotGrid service.

We observed an increased error rate shortly after a new version of ShotGrid was released. Whilst investigating we took the decision to rollback to the previous version of ShotGrid out of an abundance of caution.

Our investigation subsequently identified that an infrastructure maintenance job which had no expected impact on the ShotGrid service was triggering rapid process respawning on our app components, resulting in increased memory and CPU pressure and impairing the ability of the app to process client requests.

Scope of impact

Clients sporadically received HTTP 500, 502, 503 and 504 responses when making requests to ShotGrid during the incident window.

What we'll do to prevent this incident from happening again?

We are improving our infrastructure code to prevent unnecessary process respawning occurring on ShotGrid app components.

Posted Dec 06, 2021 - 14:52 UTC

Resolved
This incident has been resolved.
Posted Dec 02, 2021 - 18:55 UTC
Monitoring
The issue has been fixed and we are monitoring the health of the stack.
Posted Dec 02, 2021 - 17:54 UTC
Update
We are currently rolling back to the previous version to fix the memory allocation issues.
Posted Dec 02, 2021 - 17:17 UTC
Investigating
We are observing some failed requests (502, 503, 504 errors) to the ShotGrid service which may impact site availability for some clients. This issue is under investigation.
Posted Dec 02, 2021 - 16:37 UTC
This incident affected: Flow Production Tracking.