Service Degradation - ShotGrid
Incident Report for Flow Production Tracking
Postmortem

What happened?

On November 11th, 2021 at 14:10UTC, an update to the ShotGrid service introduced intermittent performance degradation for some clients which was later identified and subsequently fixed on November 18th, 2021 at 13:45UTC.

On November 16th an investigation was launched into a cluster of support requests where clients reported receiving HTTP 502 and 503 responses. Alerts were not triggered by our monitoring tools due to the sporadic nature of the errors. We identified that a code change introduced into ShotGrid on November 11th was responsible for an increased volume of requests to our feature flag service, resulting in degraded performance. A fix was implemented on November 18th which addressed this performance degradation.

During the course of the investigation we identified additional conditions which could result in performance degradation due to inconsistencies in memory utilization. A fix containing memory allocation optimizations was implemented on November 18th, 2021 at 18:55UTC. This further reduced the number of errors experienced by clients.

Scope of impact

Clients sporadically received HTTP 502 or 503 responses when making requests to ShotGrid.

What will be done to prevent this incident from happening again?

Increased compute capacity has been allocated to the feature flag service to improve its performance and redundancy.

Improvements to our monitoring tools are being implemented to track and alert us to unusual patterns of 5XX errors.

Posted Nov 24, 2021 - 15:28 UTC

Resolved
So far so good, we see a big drops in error rate since the deployment of the fix. We will continue to monitor actively over the next days. A postmortem will be published here in a few days.
Posted Nov 18, 2021 - 21:19 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 18, 2021 - 19:29 UTC
Identified
We are observing a low number of failed requests (502, 503 and 504) to the ShotGrid service which impact some clients. This is an ongoing issue since a few days. We have identified a source related to out of memory errors and will deploy a fix soon.
Posted Nov 18, 2021 - 17:29 UTC
This incident affected: Flow Production Tracking and Notification Service.