Anomaly Detection for Cash Flow: Less Magic, More Plumbing

A finance lead once asked me to build “an AI that flags when the cash flow looks weird.” Reasonable ask. The weird part is that by the time you can answer it, you have done so much work that the AI is almost an afterthought.

I have built this over financial records spanning many institutions: a payments business, a marketplace’s settlement ledger, bank feeds, processor reports, the internal subledger. The pitch is a model that learns normal and screams when something deviates. What you actually spend your time on is deciding what a transaction even is when four systems describe the same money four different ways. The model is the last ten percent. The plumbing is the other ninety, and it is where the value hides.

You cannot detect an anomaly you cannot define

Here is the trap. “Anomaly detection” sounds like a modeling problem, so people reach for a modeling tool. Isolation forests, autoencoders, a fashionable transformer that eats time series. None of that helps if you cannot first say, with confidence, that these two lines from two systems are the same event.

Normal is not a statistical property you discover. It is a thing you construct by getting records to agree. Before any model runs, three boring problems have to be solved, and each one quietly eats a week.

Format. Every source speaks its own dialect. A processor sends you 2025-09-14T00:00:00+08:00 and an amount in minor units as a string. The bank sends a value date and a number with a thousands separator. The subledger sends a posting timestamp in UTC and a sign convention where a refund is positive. None of these are wrong. They just do not agree, and agreement is the product.

Currency. The headline number is in the transaction currency, the settled number is post-conversion, and the rate that connects them is on a third report with its own rounding. If you compare the raw amounts you will flag every cross-border payment as an anomaly, which is not detection, that is noise with a confidence score.

Timing. The customer pays Sunday night. The processor batches it Monday. The bank values it Tuesday. The subledger posts it whenever the close job runs. Same money, four timestamps, and if you match on exact time you match nothing.

So the first real artifact is not a model. It is a normalizer.

def normalize(raw, source):
    # one shape to rule them all. every source feeds through here before
    # anything downstream is allowed to look at it. no exceptions, because
    # the one time you let a "quick" raw comparison through is the day
    # you ship a false positive storm at 6pm on a Friday.
    cents = to_minor_units(raw["amount"], raw.get("currency", "USD"))
    if source == "subledger" and raw["type"] == "refund":
        cents = -abs(cents)                 # subledger signs refunds positive. we don't.
    return {
        "event_id": raw.get("ref") or raw.get("id"),
        "amount_cents": cents,              # always signed minor units, always settlement ccy
        "ccy": raw.get("settle_ccy", raw["currency"]),
        "occurred_at": to_utc(raw["timestamp"], raw.get("tz", "UTC")),
        "source": source,
    }

Nothing here is clever. That is the point. The discipline is that every record, with no exceptions, goes through this funnel before anything else is allowed to touch it.

Matching comes before the math

Once records are normalized, you match. This is the step the modeling literature skips entirely, and it is the step that decides whether your anomaly detector finds real breaks or hallucinates them.

A break in reconciliation is not “this number is unusual.” It is “this event exists on one side and not the other,” or “the same event has two different amounts depending on who you ask.” To find that, you group records that should be the same event and look at the disagreement inside the group.

The naive match is on a shared reference ID, and you should always try that first because when it works it is unambiguous. It does not always work. IDs get truncated, a manual journal entry has no upstream reference, a processor reuses an ID across a settlement boundary. So you need a fuzzy fallback, and the fuzzy fallback is where people get seduced into machine learning when a tolerance window does the job.

def match(records, amount_tol=2, time_tol_hours=72):
    # bucket by exact ref first. that's the clean path and most volume lands here.
    by_ref, leftovers = {}, []
    for r in records:
        (by_ref.setdefault(r["event_id"], []).append(r)
         if r["event_id"] else leftovers.append(r))

    groups = [g for g in by_ref.values()]

    # leftovers get the fuzzy treatment: same ccy, amount within a couple of
    # cents (fx rounding), occurred within a window that covers settlement lag.
    # this is a greedy nearest-match, NOT a model. a human has to defend every
    # break to an auditor, and "the timestamps were 41 hours apart" defends
    # itself. "the embedding said so" does not.
    leftovers.sort(key=lambda r: (r["ccy"], r["amount_cents"], r["occurred_at"]))
    used = set()
    for i, a in enumerate(leftovers):
        if i in used:
            continue
        group = [a]
        for j in range(i + 1, len(leftovers)):
            b = leftovers[j]
            if j in used or b["ccy"] != a["ccy"]:
                continue
            if abs(b["amount_cents"] - a["amount_cents"]) <= amount_tol \
               and hours_between(a["occurred_at"], b["occurred_at"]) <= time_tol_hours:
                group.append(b); used.add(j)
        groups.append(group)
    return groups

That amount_tol of two cents is not laziness. It is the accumulated rounding of a currency conversion and a fee calculation, and if you set it to zero you turn arithmetic into an incident every single day. The tolerances are the model. They encode real knowledge about how this money moves, and they are legible to the human who has to act on the output.

The scoring is deliberately dumb

Now, and only now, the part everyone wanted to start with. You have groups of records that should agree. The anomaly score is how badly they disagree, plus a few rules for the disagreements that are categorically bad.

def score(group):
    # a one-sided break (event on one system, missing on the other) is the
    # most expensive kind and the easiest to explain. lead with it.
    sources = {r["source"] for r in group}
    if len(group) == 1:
        return {"score": 1.0, "reason": f"one-sided: only in {group[0]['source']}"}

    amounts = [r["amount_cents"] for r in group]
    spread = max(amounts) - min(amounts)
    if spread > 0:
        return {"score": min(1.0, spread / max(1, abs(amounts[0]))),
                "reason": f"amount mismatch of {spread} cents across {sources}"}

    return {"score": 0.0, "reason": "matched clean"}

No isolation forest. No neural net. A one-sided break scores high because it is the expensive kind. An amount mismatch scores in proportion to how big the gap is relative to the transaction. Everything that ties out scores zero and is never seen by a human, which is the actual goal.

I have tried the exotic methods on this exact problem. An autoencoder trained on “normal” reconciliation will happily learn the seasonal shape of settlement and flag the legitimate month-end spike while waving through a quiet, deliberate skim that sits inside the learned distribution. Worse, when it does fire, it cannot tell you why. It hands a finance analyst a number between zero and one and a shrug.

That shrug is the whole problem. Because the cost of a false positive here is not a wasted GPU cycle. It is a person, a senior one, opening four systems and chasing a discrepancy that was never real, losing an afternoon to a ghost the model invented. Do that a few times and they stop trusting the alerts, and an anomaly detector nobody trusts is worse than none, because now you are paying for false confidence.

What I would tell someone starting today

Spend your first month on ingestion and normalization and resist every urge to open a notebook and fit something. The day your sources reliably produce the same event in the same shape, your “model” can be three rules and a tolerance, and it will outperform anything fancier because it can defend itself.

Make every alert explainable in one sentence a non-engineer can read and act on. “Only in the processor feed, not the bank” is an alert someone can work. A score with no sentence attached is a thing they learn to ignore.

And keep the threshold conservative early. You want the analyst’s first ten alerts to all be real, because trust is the only currency that matters and you spend it down fast. The fancy methods can come later, if they ever earn their seat. Mostly they do not, and the plumbing you were too impatient to build is the thing still catching the breaks a year later.