The “Query of Death” – When a Single Query Brings Everything Down

Blog | SRE

A system outage happens because a bad query runs repeatedly, using up resources or even crashing the app. To fix it, the query can be blocked or redirected before it reaches the system. Using rate limiting or circuit breakers can also help. This problem, known as the “Query of Death,” shows the importance of stopping bad queries early to keep the system stable.

The Scenario

Imagine you’re on call late at night, and suddenly your application becomes unresponsive. The monitoring tools are showing a significant spike in resource usage, and after digging into the logs, you identify that a particular database query is being executed repeatedly. After digging through the logs, you find the culprit: a single database query, running over and over again like a bad horror movie on repeat. Whether it’s due to a bug, a poorly written request, or just bad luck, this one query is hogging all the CPU and memory, bringing your entire system to its knees. And the worst part? Every time it runs, the app crashes again, creating an endless cycle of outages.

Mitigating the Impact

The quickest way to remedy this situation is to prevent the query from ever reaching your application. One approach could be to configure your ingress controller or API gateway to detect this specific query and return a 404 error or a redirect response instead of allowing it to pass through to the backend. This can be achieved by setting up a rule that matches the query’s characteristics and intercepts it before it causes damage. Additionally, you could deploy rate limiting or circuit breakers that detect when a query starts causing problems and automatically mitigate the risk.

Understanding the Pattern & Preventing Future Incidents

The “Query of Death” is a common pattern where a specific query or API call causes an application to crash, lock up, or consume excessive resources. This pattern often occurs when a query is poorly optimized, contains a logical flaw, or is repeatedly executed in an unintended way.

The primary solution in these scenarios is to block or redirect the offending query before it reaches the application. This can be accomplished through various means, such as configuring the API gateway, ingress controller, or firewall to detect and handle the query differently. By preventing the query from executing, you stabilize the application and can then investigate and resolve the root cause without the pressure of an ongoing outage.

Stay tuned!

Is your application ready to handle sudden spikes in load without crashing? In our next installment, we’ll discuss managing high resource demands to keep your system running smoothly through the night—and how to do better in the future.

More Info & Contact

Prevent the “Query of Death” from bringing your system down – our experts will help you find the ideal solution! Reach out to us at hello@qualityminds.de or call us at +49 911 660732011!

This post is part of a larger series about typical outage patterns and how to short term mitigate them while on-call. Please visit Foreword: Navigating the Storms of Software Outages | QualityMinds to find all other posts.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

written by

Matthias Thubauville