Your Recommendation Engine Doesn't Need Deep Learning (Yet)

A junior engineer once asked me, in a planning meeting, why our recommender wasn’t a neural network. We were serving hundreds of millions of recommendations a day across a marketplace at the time. The honest answer was that the thing doing most of that work was a SQL query and a counter. He looked like I had told him the moon landing was filmed in a parking lot.

That reaction is the whole problem. Somewhere around 2018 the field decided that “recommendations” and “deep learning” were the same word, and a generation of engineers learned the second one first. So they build the cathedral before the village. They stand up a two-tower model with embeddings and a feature store and a GPU serving fleet, to solve a problem that a co-occurrence table would have closed in a week, with less to operate and a result they could actually debug.

I want to walk the actual ladder we climbed, because the rungs matter. You earn the next one. You don’t start at the top.

Rung one: people who bought this also bought that

The first version of almost any recommender worth running is co-occurrence. Two items are related if the same people interact with both. That’s it. No model, no training loop, no embedding dimension to argue about in a design review. You count.

Here is the part nobody tells you. This boring counter, computed nightly over your event log, is good enough to ship and frequently good enough to beat the fancy thing for months. On a marketplace it gave us “frequently bought together” and “people who viewed this also viewed” that drove real revenue while the ML team was still drawing boxes on a whiteboard.

# Item-to-item co-occurrence from a session log.
# A "session" is one user's basket or browse run within a window.
# We are not modeling anything. We are counting things that happen near
# each other and trusting that signal more than we trust our own taste.

from collections import Counter, defaultdict
from itertools import combinations
import math

pair_counts = Counter()
item_counts = Counter()

for session in sessions:                     # sessions = list of item-id lists
    items = set(session)                     # dedupe within a session
    for it in items:
        item_counts[it] += 1
    for a, b in combinations(sorted(items), 2):
        pair_counts[(a, b)] += 1             # (a,b) ordered so we count once

def related(item, top_n=10, min_support=20):
    scored = []
    for other in item_counts:
        if other == item:
            continue
        key = (item, other) if item < other else (other, item)
        co = pair_counts.get(key, 0)
        if co < min_support:                 # kill long-tail noise early
            continue
        # lift over independence. raw co-counts just surface popular junk;
        # everything co-occurs with the best-seller. this normalizes it.
        lift = co / (item_counts[item] * item_counts[other])
        scored.append((other, lift * math.log1p(co)))
    scored.sort(key=lambda x: x[1], reverse=True)
    return [it for it, _ in scored[:top_n]]

The min_support line is doing more than it looks. Without it your top recommendations for everything are just your top-selling items, because the best-seller co-occurs with the entire catalog. The lift term fixes the same disease from the other side: it asks whether two items appear together more than chance would predict, not just whether they’re both popular. Get those two things right and a counter will embarrass a lot of models.

(I have watched a team spend a quarter on a sequence model that lost an A/B test to this query. The query took an afternoon. Nobody put that in the postmortem.)

Rung two: collaborative filtering, still no neural net

Co-occurrence is item-to-item. The next rung is personalized: what should this user see, given everyone who looks like them. That’s collaborative filtering, and the version that carried us furthest was implicit-feedback matrix factorization. Off-the-shelf, on a CPU, retrained nightly from a sparse user-item matrix of clicks and purchases.

This is where most teams already have what they need and don’t know it. Implicit ALS on a few hundred million interactions runs fine on a beefy machine and produces user and item vectors you can serve with an approximate nearest-neighbor lookup. No GPUs. The matrix is sparse, the math is decades old, and the libraries are mature. We ran this for personalized home-feed ranking and it held the line against everything more elaborate we tried for a long stretch.

The scar: the cold-start hole is real and it is not subtle. A new item has no interactions, so factorization has nothing to factorize, and it never gets shown, so it never earns interactions. The flywheel spins for items it already knows and starves the new ones. We patched it the dumb way at first, by blending in the co-occurrence table and a recency boost for fresh inventory, and the dumb way worked well enough that I stopped apologizing for it.

The exact rung where deep learning pays

So when did we actually build the neural ranker. Not when it got fashionable. When three pressures showed up at once and the simple stack genuinely ran out of room.

First, features. Co-occurrence and CF see one signal: who interacted with what. The moment the business needed to rank on a pile of heterogeneous features at once, price sensitivity, seller quality, delivery time, time of day, category affinity, query text, a linear blend of hand-weighted scores became a nightmare to maintain. Every new feature meant re-tuning weights by hand and breaking two others. A learned ranker eats those features for breakfast. That is the thing it is genuinely better at.

Second, cold start at scale. When the catalog churns fast and a large share of impressions are items with thin history, a content-aware model that can rank an item from its attributes before it has behavior is worth the cost. CF structurally cannot do this. A model with item features can.

Third, the objective got complicated. Early on we optimized clicks, and a counter optimizes clicks fine. Once the business wanted to rank on expected downstream value, click times conversion-likelihood times margin, minus a return-risk penalty, with diversity constraints so the feed didn’t collapse into ten near-identical listings, that is a real ranking objective. Hand-tuned heuristics do not get you there. Learning-to-rank does.

Notice what is not on that list. “Our competitor uses deep learning.” “We hired someone who wants to.” “It’s 2023.” None of those are reasons. The reasons are features, cold start, and a ranking objective the simple stack physically cannot express. Until at least two of those bite, you are buying complexity you will pay to operate and cannot yet explain to the person whose feed got worse.

What I actually tell people

Build the counter first. Ship it. Measure it. It is your baseline forever, and a baseline you can compute in SQL is a baseline you can trust at 3am.

Add collaborative filtering when you need personalization, and run it on CPUs until the numbers, not the narrative, push you off them. Keep the co-occurrence table alongside it as the cold-start patch, because it always will be.

Reach for the deep ranker when you have rich features that a linear blend can no longer hold, cold start you cannot solve any other way, and an objective that is more than one number. When that day comes, the model will earn every bit of its complexity, and you will know exactly why, because you will have spent months watching the simpler thing fall short in specific, nameable ways.

That knowledge is the real prize. The teams that skip the ladder build the neural net, watch it underperform a counter they never wrote, and have no idea why. The team that climbed knows precisely what the model is for. So when the new engineer asks why it isn’t a neural network, you have an answer better than “it should be.”