External Calls Fail Sometimes: Exponential Backoff and a Retry Budget
Transient failures like 429, 503, and timeouts can be recovered by retrying. But never retry a non-idempotent write without an idempotency key, and stop before 6 minutes.
External and service calls fail intermittently: build retries in from the start. But don’t retry just anything. First classify the operation: idempotent (safe) / safe-with-key / non-idempotent (risky).
Why it matters
Without retries, a momentary 429 fails the whole job. Retry a non-idempotent write blindly and the same payment or the same email goes out twice. Both are expensive.
Transient failures only, with backoff
Retry on: 408, 429, 500, 502, 503, 504, and network timeouts. Everything else (400, 401, 403) will fail the same way, so stop immediately.
function withBackoff(fn, { max = 5 } = {}) {
let wait = 400;
for (let attempt = 1; ; attempt++) {
try {
return fn();
} catch (err) {
if (attempt >= max || !isTransient(err)) throw err;
Utilities.sleep(wait + Math.floor(Math.random() * 200)); // jitter
wait = Math.min(wait * 2, 8000); // exponential, capped
}
}
}
- Add jitter so multiple runs don’t retry at the same instant.
- Keep max elapsed time under the 6-minute limit. If it might exceed, save a cursor and continue on the next run.
Non-idempotent writes need an idempotency key
// process once even if the same request arrives twice
if (props.getProperty("done:" + idemKey)) return cached;
const result = doWrite();
props.setProperty("done:" + idemKey, "1");
Deeper: normalize partial failure
Batch calls with fetchAll and some may fail. Normalize results as { ok, status, body, error }, and instead of failing everything, proceed with the successes and retry only the failures. Log a correlation id, not the raw payload.
One line to keep: retry only transient failures with backoff + jitter, and guard non-idempotent writes with an idempotency key.
Frequently asked questions
- Which HTTP status codes should trigger a retry?
- Retry only on transient failures: 408, 429, 500, 502, 503, 504, and network timeouts. Client errors like 400, 401, and 403 will fail the same way on retry, so stop immediately.
- Why is retrying a non-idempotent write dangerous?
- The same payment or the same email can be sent twice. Use an idempotency key to check whether the operation was already completed before executing the write.
- Why add jitter to exponential backoff?
- Without jitter, multiple failed runs retry at exactly the same moment and hammer the server again. Adding a small random delay spreads the retries out over time.