API Degraded performance

Back to overview

Degraded

Aug 22 at 01:28pm CEST

Affected services

Production

Resolved
Mar 19 at 05:15pm CET

Post-Mortem: EU Region Request Timeout Incident

Incident Overview

Incident Start: 19th March 2025, 15:15 UTC
Incident End: 19th March 2025, 16:15 UTC

Impact: The incident caused timeouts for 4.5% of incoming requests, primarily affecting the EU region. The majority of affected requests were to the /v2/calculations endpoint. No security concerns were identified.

Timeline

19th March 2025, 15:15 UTC: The incident began immediately following a deployment, with a series of requests timing out in the EU region.
19th March 2025, 15:17 UTC: The issue was acknowledged, and the tech lead was engaged to investigate.
19th March 2025, 15:42 UTC: A potential root cause was identified tied to high CPU usage and cache reloading post-release.
19th March 2025, 16:02 UTC: An immediate fix was deployed by reducing containers and disabling the cache reload worker.
19th March 2025, 16:15 UTC: The issue was fully resolved after containers were shuffled and the request backlog cleared.
Root Cause Analysis
The incident was triggered by a spike in traffic and high CPU usage immediately following a release. The root cause was traced to the cache reload process across containers, which overwhelmed the Puma queue. This led to timeouts as the system struggled to handle the simultaneous cache reload and incoming requests. Metrics showed no suspicious activity beyond the CPU and traffic spikes, confirming the issue stemmed from the cache reload mechanism during the deployment.

Resolution and Recovery

To resolve the incident:

The number of running containers was reduced to alleviate database strain.
The cache reload worker was disabled to prevent further queuing.
All pending requests were allowed to process fully.
The cache sync job was re-enabled once the spike subsided.

By 16:15 UTC, the system stabilized, and normal operation resumed in the EU region.

Impact Assessment

Approximately 4.5% of requests timed out during the incident, primarily affecting the /v2/calculations endpoint in the EU region. The US region remained unaffected. While not a security threat, we acknowledge the disruption this caused for affected users. For assistance with impacted requests, such as re-running calculations, please contact the SQUAKE team.

Preventive Measures

To prevent recurrence, we are implementing the following:

Enhanced Observability: Improving tooling to monitor application performance and resource usage more effectively, especially during deployments.
Cache Reload Optimization: Revising the cache reload process to lock it during boot, preventing parallel job enqueuing by Sidekiq.
Load Testing: Conducting pre-release load tests to ensure consistent performance under varying traffic conditions, even if unrelated to the specific release.
Release Timing Strategy: Exercising greater caution in scheduling releases to avoid overlapping with peak traffic periods. We are committed to learning from this incident and enhancing the reliability of our services moving forward.

Updated
Aug 23 at 09:45am CEST

Incident Overview

Incident Start: 22nd August 2024, 13:28 UTC

Incident End: 23rd August 2024, 09:45 UTC

Impact: API requests were met with 504 errors, primarily affecting traffic originating in the EU. The incident resulted in a failure to process 2.5% of incoming requests during the affected period.

Timeline

22nd August 2024, 13:28 UTC: The first signs of the issue were observed when response times drastically increased. In response, we scaled horizontally by increasing the number of containers, which temporarily mitigated the issue.
23rd August 2024, 07:45 UTC: Despite the horizontal scaling, the issue resurfaced as requests were spread across the machines, eventually overwhelming the system again. At this point, we identified that the root cause was insufficient compute power in our containers.
23rd August 2024, 09:45 UTC: To address the issue, we scaled vertically by adding additional compute power to the machines, which successfully resolved the incident.

Root Cause Analysis

The outage was triggered by the launch of a new methodology (GATE4 methodology) that significantly increased the memory consumption due to the large dataset it processed. Simultaneously, we experienced an unexpected traffic spike, complicating the identification of the exact root cause initially. The combined effect of increased memory usage and traffic overwhelmed our system, leading to 504 errors.

Resolution and Recovery

Once the root cause was identified as a lack of compute power in our containers, we scaled the machines vertically by adding additional compute resources. This action successfully mitigated the issue, and the API service was restored by 07:45 UTC on 23rd August 2024.

Impact Assessment

During the incident, approximately 2.5% of requests failed to process. We recognise the inconvenience this may have caused and are committed to ensuring such issues are addressed promptly. If you require any assistance, such as re-running calculations or investigating specific requests, please reach out to us.

Preventive Measures

To prevent similar incidents in the future, we are implementing the following actions:

Improved Scaling Policies and Alarms: We are refining our scaling policies and alarms to better handle unexpected traffic spikes and resource demands.
Enhanced Notifications: We are adding new notification systems to ensure we are promptly alerted to potential issues, enabling us to take immediate action.

We take every incident seriously and are committed to learning and improving from each one. Our team is dedicated to preventing future occurrences and ensuring the reliability of our services.

Created
Aug 22 at 01:28pm CEST

We're experiencing degraded performance on API /calculations endpoint