The Scenario: Your Code’s Fine. The Problem Isn’t Yours.
Your app starts misbehaving. Errors spike. Features break. You brace yourself, expecting a fire in your own systems – but everything checks out.
So what’s going on? The issue isn’t in your code. It’s upstream. A critical third-party service is failing or throttling your requests. Maybe your payment gateway is down. Maybe an external API is rejecting calls because you’ve hit a rate limit. Either way, your app is waiting on responses that never come – and your users are left with broken features. Let’s look at what’s really happening, how to deal with it fast, and how to make sure it doesn’t take you down next time.
There are two common scenarios
- Outage on Their End:
Let’s say you rely on a payment gateway to handle transactions. Suddenly, their service slows to a crawl or goes completely dark. Your code is ready to process the payment, but nothing’s coming back. Checkout fails. Revenue drops. Your hands? Tied. - You being Throttled:
Now imagine traffic spikes, maybe you’re running a campaign, maybe you just hit the front page. Your app starts bombarding an external API with requests. But there’s a limit. Exceed it, and the API starts throttling or rejecting calls altogether. Your users see broken pages, not rate-limit errors.
In both cases, the outcome is the same: a critical feature stops working, and the root cause lies outside your system. A third-party service you rely on is either down or enforcing constraints like rate limits. As the on-call engineer, you must piece this puzzle together: why are these errors happening, and how can you mitigate the impact right now?
The Quick Fix: Reduce the Blast Radius
When the outage hits, you need to act fast to minimize user impact. Here’s what you can do in the moment:
- Disable Dependent Features: If payments are broken, disable checkout. Consider fallback options like “Cash on Delivery” if available. Don’t let one broken integration take down the whole app.
- Cache or Default Responses: For read-only APIs, serve cached data instead of failing calls. Slightly outdated info is still better than showing an error.
- Tune Your Circuit Breakers: If you’ve implemented circuit breakers, now’s the time to adjust them. Cut off failing services quickly to preserve system stability.
- Apply Client-side Rate Limits: Throttle your own calls to stay below third-party limits until things stabilize.
The Pattern: One Dependency, Many Failures
The real problem isn’t the outage itself, but rather the assumption that external systems are always available. Any time you rely on a third-party service for critical functionality, you’re building your app on someone else’s uptime. That’s fine – until it isn’t.
The Long-Term Fix: Build for Failure
To prevent this from happening again (or at least reduce the pain), build with failure in mind:
- Redundancy: For mission-critical services, use multiple providers. If one payment processor fails, route to a backup.
- Infrastructure as Code: Automate failover. If a provider goes down or throttles you, IaC can help you switch providers quickly and safely.
- Circuit Breakers: Don’t keep hammering a failing API. Use circuit breakers to detect repeated failures and temporarily block calls, avoiding cascading issues.
- Cache Strategically: Don’t fetch data repeatedly if it rarely changes. Cache responses, even short-term, to reduce load and dependency.
- Batch and Debounce: Don’t make 1000 tiny calls when one big call will do. Batching requests helps you stay within limits and reduces overhead.
- Webhooks over Polling: Polling creates load. Webhooks push updates only when needed. Use them where possible.
Being Ready: What Resilience Looks Like
Third-party issues can be just as disruptive as your own bugs – if not more. Building resilience means preparing across three dimensions:
- Visibility: Monitor traffic patterns and external service health. Know your limits before you hit them.
- Resilience: Architect your app to degrade gracefully—disable features, have fallbacks, retry smartly.
- Proactive Planning: Talk to your providers before the crisis. Understand rate limits. Set up escalation paths. Know how to reach a human when the service stalls.
In Short
External services are powerful, but also potential points of failure. When they go down, you go down. By doing the legwork upfront, you can reduce the severity and duration of these incidents. So treat every integration as a potential failure point. Design your system to expect issues. And when the inevitable outage happens, you’ll be ready – not scrambling.
Stay tuned, Matthias
This blog post is part of our multi-part series, where we describe common software outages and help you resolve them quickly. You can find all other posts under Foreword: Navigating the Storms of Software Outages.
Schreib uns eine Mail – wir freuen uns auf deine Nachricht! hello@qualityminds.de oder auf LinkedIn