Celery retry cheatsheet

TL;DR

autoretry_for=(SomeError,) for the "obviously transient" cases.
retry_backoff=True, retry_backoff_max=600, retry_jitter=True to avoid thundering herds.
max_retries=N — always set it. The default is too lenient.
A dead-letter queue (Task.on_failure) for the cases retries can't save.

The pattern

@shared_task(
    bind=True,
    autoretry_for=(requests.RequestException,),
    retry_backoff=True,
    retry_backoff_max=600,
    retry_jitter=True,
    max_retries=5,
)
def fetch_invoice(self, invoice_id: int):
    invoice = Invoice.objects.get(pk=invoice_id)
    response = http.get(invoice.url, timeout=10)
    response.raise_for_status()
    invoice.store_payload(response.content)

Five lines of config kill 90% of the retry bugs I see in code review.

The mistakes I keep seeing

Retrying on Exception. You'll mask real bugs. Be specific.
No jitter. Twenty workers waking up at the same second on the same backoff schedule is a self-DoS.
No max_retries. Tasks that can't succeed will pile up forever.
No dead-letter handling. When retries are exhausted, something has to notice. Log + alert + park the job somewhere a human can see.

A retry is correct when the failure is independent of your input. Network blips, 503s, lock contention — retry. ValidationError, IntegrityError, KeyError — fix the code; retrying just delays the bug report.

TL;DR

The pattern

The mistakes I keep seeing

What "transient" really means