A wallet provider in one region went into a maintenance window at 2am their time, which was the middle of our peak in another. They did not announce it. We found out because our success rate on that one method dropped to nothing while everything else stayed green. The first version of our system would have shrugged, returned a decline to every customer who picked that wallet, and let them walk. A meaningful share of them would have just closed the tab.
That night is why I stopped thinking of payment orchestration as routing.
The pitch for an orchestration layer is clean. You sit in front of dozens of payment methods, card networks, bank transfers, wallets, local rails, installment products, and you pick the best one for each transaction. Pick by cost. Pick by success probability. Pick by region. Engineers hear “pick” and build a router: a function that takes a request and returns a provider. That function is the easy ten percent. The other ninety percent is what happens after the provider you picked says no, or says nothing at all, and you have a customer staring at a spinner with their card already typed in.
Every method fails in its own dialect
Here is the part nobody tells you. There is no such thing as “the payment failed.” There are a dozen different failures wearing the same red error in your dashboard, and they want opposite responses.
A card soft decline (“insufficient funds,” “do not honor”) is the network telling you to maybe try again later, or to try a different funding source, but never to hammer the same card. Retry that one and you look like a fraudster to the issuer. A timeout is worse, because you genuinely do not know if the charge went through. A regional rail going dark is not a per-transaction problem at all, it is a provider-level outage, and the correct move is to stop sending it traffic entirely until it recovers. A wallet maintenance window looks identical to an outage from the outside, and you cannot wait for an announcement that may never come.
So the orchestration layer’s real job is to classify the failure, then decide whether this transaction deserves a retry, a failover to an alternate provider, or an honest decline. Treat all four the same and you either leave money on the table or you double-charge someone, and the second one will end up in a regulator’s inbox.
The decision, simplified
This is the core of it, stripped down. Real one has more provider quirks than I can fit, but the shape is honest.
def route_payment(txn, providers):
# providers come pre-ranked by a scorer: live success rate over the
# last few minutes, cost, and whether this method even works in the
# customer's region. The scorer is boring and it is the whole game.
candidates = [p for p in rank(providers, txn)
if circuit[p.id].state != "open"] # skip the dead ones
for provider in candidates:
if fraud_score(txn, provider) > provider.fraud_ceiling:
continue # this provider hates this txn shape, try the next
# idempotency key is per (txn, provider). NOT per txn. if we fail
# over to provider B, B needs its own key, or a late "success"
# from A plus a charge on B is how you double-bill a customer.
key = idem_key(txn.id, provider.id)
outcome = provider.charge(txn, idempotency_key=key)
if outcome.ok:
return outcome
verdict = classify(outcome) # decline / timeout / outage
circuit[provider.id].record(verdict)
if verdict == "hard_decline":
return outcome # the customer's bank means it. stop.
if verdict == "timeout":
# we do NOT know if A charged. never silently retry a timeout
# on the same provider. reconcile it out of band, move on.
reconcile_later(txn, provider, key)
continue
# soft decline or provider outage: fall through to next candidate
return declined(txn) # we tried everyone worth trying. tell the truth.
The comment about idempotency keys is the most expensive lesson in that block. When you fail over, you are by definition sending the same logical payment to a second provider because the first one did not give you a clean answer. If the first provider was just slow and the charge actually landed, and your second provider also succeeds, you have charged the customer twice for one cart. The only thing standing between you and that incident is a key that is unique per provider attempt, plus an out-of-band reconciliation job that catches the orphaned “success” from the slow provider and voids one side. We learned to treat any timeout as a maybe-charge, never a no-charge. Optimism on a timeout is how refunds get expensive.
Failover is a circuit breaker, not a loop
The naive failover is a for-loop over providers: try A, try B, try C. It works until A is not failing per-transaction but is simply down. Then every single payment pays the full timeout to A before it ever reaches B, and your checkout latency falls off a cliff at exactly the moment you have a queue of customers.
So failover at the provider level is a circuit breaker, not a retry loop. Each provider has a breaker that watches its recent verdicts. Enough timeouts or outage signals in a short window and the breaker opens, and that provider is pulled out of the candidate list entirely for everyone, not just the one unlucky customer who hit it. A trickle of probe traffic checks whether it has come back. When it has, the breaker closes and the provider rejoins the pool. That wallet maintenance window from the opening? The breaker had it isolated within a minute or two, and the routing scorer quietly sent that traffic to alternates while the provider was dark. Customers never saw the maintenance window. They just paid.
That is where the recovered drop-offs come from. Not from clever routing on the happy path, where any sane default works fine, but from the failure path: catching the soft declines that a different funding route would clear, isolating the dead providers before they tax everyone, and never giving up on a transaction that a second method would have taken. Across the methods we ran, smart retry and failover clawed back a meaningful share of payments that the first version would have dropped on the floor. The single biggest lever was not adding more payment methods. It was failing better through the ones we had.
Fraud scoring lives inside the path, not beside it
The temptation is to score fraud as a gate before routing: check the transaction, then route the clean ones. That is wrong, because fraud risk is not a property of the transaction alone. It is a property of the transaction and the method. The same checkout might be fine on a 3DS-authenticated card and a bad idea on a particular instant rail with weak chargeback recourse.
So the model scores inline, per candidate provider, in the same hot path as routing, on a tight latency budget because the customer is waiting. A transaction that scores too high for one provider’s risk tolerance does not get declined outright. It gets routed to a provider that can absorb that risk, or to one that forces a stronger authentication step. Fraud scoring becomes another input to failover, not a wall in front of it.
What the diagram leaves out
The arrows make this look orderly. The orderly part is the easy part.
The diagram shows a request entering the router, the router consulting the scorer and the inline fraud check, an attempt on the chosen provider, and failover to alternates when the breaker says a provider is dead. What it cannot show is time. Every one of those arrows has a deadline, the customer is the clock, and the reconciliation job running underneath the whole thing is the only reason a failover never becomes a double charge.
Build the router last. Build the failure taxonomy first. The methods will keep multiplying, every one of them will break in a way the others do not, and the system that survives is the one that treated “no” as the beginning of the work instead of the end of it.