The first dashboard we ported went from a coffee break to a blink. A query that used to take the better part of a minute on the warehouse came back before I had finished reading the question. I remember thinking we had broken something. We had not. That was just ClickHouse doing the thing it does, and it is genuinely that fast when you feed it right.
Then I tried to update a row, and the honeymoon ended.
This is about an analytics workload I owned: high-cardinality event data, billions of rows arriving daily, dashboards and ad-hoc queries on top, and a warehouse bill that had stopped being funny. We moved the heaviest slices of it to ClickHouse. The wins were real. The query latency dropped through the floor and the storage and compute cost came down enough that finance asked me to explain it twice. But the thing nobody puts in the launch post is that ClickHouse pays you back only if you design for its grain. Treat it like Postgres with a faster engine and it will hurt you in ways that are hard to see until you are in production.
Mutations are not updates
Here is the part nobody tells you. In ClickHouse, an UPDATE is not an update. It is a mutation, an asynchronous, background rewrite of every data part that matches your WHERE clause. You issue ALTER TABLE ... UPDATE, you get a shrug, and somewhere a thread starts rewriting columns on disk. There is no transaction. There is no “this row, right now.” There is “this will be true eventually, after I have rewritten a lot of files you did not think you were touching.”
The first time a teammate ran a routine-looking correction on a wide table, the cluster spent the afternoon grinding through a mutation that touched far more data than the handful of rows he meant to fix. (He learned. We all learned.) If your workload has a steady stream of single-row updates, you are holding the tool by the blade. ClickHouse wants appends. Design so that corrections are new rows with a later timestamp, and let a ReplacingMergeTree or an aggregation at read time sort out which version wins. You stop fighting the engine the moment you stop asking it to mutate in place.
The ordering key decides everything
This is the lever. Get it right and queries skip almost all of your data before they read a byte. Get it wrong and ClickHouse scans far more than it should and you blame the hardware.
A MergeTree table sorts data on disk by its ORDER BY key and keeps a sparse index over it. That sparse index is the whole magic. When your query filters on a prefix of the ordering key, ClickHouse reads only the granules that can possibly match and ignores the rest. The ordering key is not a nicety you add later. It is the physical layout of the table.
CREATE TABLE events
(
event_date Date,
tenant_id UInt32,
event_type LowCardinality(String),
user_id UInt64,
properties String, -- raw JSON, parsed at query time, not at write
ingested_at DateTime
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
-- ORDER BY is the index. Put the columns you filter on FIRST,
-- coarsest to finest. tenant_id leads because every query is
-- scoped to one tenant. event_type is low-cardinality so it
-- buys cheap skipping. user_id last because it is the needle.
ORDER BY (tenant_id, event_type, event_date, user_id);
Notice what is not here: no surrogate primary key, no per-row uniqueness, no foreign keys. Partitioning is by month, which keeps DROP PARTITION as a near-instant way to expire old data (dropping a partition is the cheapest delete you will ever run, and you should lean on it instead of issuing real deletes). The ordering key leads with tenant_id because that is the filter on the front of every single query we run. If I had led with event_date because that felt natural coming from a time-series mindset, every tenant-scoped query would scan across all tenants for that day and throw most of it away.
The query that exploits this looks almost boring, and that is the point.
-- This is fast because the WHERE prefix matches the ORDER BY prefix.
-- ClickHouse skips straight to this tenant's granules and never
-- looks at the other tenants' data at all.
SELECT
event_type,
count() AS n,
uniqExact(user_id) AS users
FROM events
WHERE tenant_id = 4218
AND event_date >= today() - 30
GROUP BY event_type
ORDER BY n DESC;
Run that on a well-ordered table and it touches a sliver of the data. Run the same logical query on a table ordered by event_date first and you will watch it read ten times as much to answer the same question. Same SQL. Same engine. The difference is entirely in the line nobody reviews carefully.
JOINs are a different mental model
Coming from a row store, you reach for JOINs by reflex. ClickHouse will do them, and it has gotten better at them, but the engine is built for one wide, denormalized table that you scan brutally fast, not for stitching five normal-form tables together at query time. The default JOIN loads the right-hand side into memory, so a join against anything large is a memory event, not a query.
What worked was denormalizing on the way in. Fold the dimensions you filter and group by into the event table at ingest, so that read time is a single scan with no stitching. When we did need to enrich against a smallish reference table, we kept it as a dictionary, which ClickHouse holds in memory and looks up with dictGet for the cost of a hash probe instead of a join. The mental shift is the hard part. You are trading the storage you save by normalizing for the speed you gain by not joining, and at this scale that trade is not close.
Inserts are eventually consistent, and that surprises people
You insert a batch. You immediately query for it. It is not there yet, or only some of it is. Nobody did anything wrong. ClickHouse writes inserts as parts and merges them in the background, and replication across nodes is asynchronous by default, so “I just wrote it” and “I can read it everywhere” are two different moments in time. For analytics this is almost always fine, because you are looking at aggregates over billions of rows and a few seconds of lag in the tail does not move the chart. For anything where a human writes a value and expects to see it back on the next screen, it is a trap. We learned to insert in fat batches on a steady cadence through a buffer, never row by row from application code, and to stop expecting read-your-write the instant the insert returned.
The ops learning curve is real
You also inherit a system that fails in its own dialect. Merges fall behind under heavy insert load and parts pile up until the cluster complains it has too many of them. Mutations queue up quietly and you only notice when one has been chewing for hours. Memory limits bite mid-query because a JOIN or a GROUP BY on high cardinality tried to materialize more than the box had. None of this is exotic once you have lived with it, but on day one it is a new vocabulary of failure, and your Postgres instincts will point you at the wrong cause every time. Budget for a person to actually learn the engine. The savings are real, but they are not free, and pretending the ops cost is zero is how teams turn a win into an incident.
So was it worth it
Yes, and it was not close. For the analytics shape, scan a lot, aggregate hard, filter on a known prefix, ClickHouse gave us speed the warehouse could not touch at a cost that made the migration pay for itself fast. But every bit of that win came from designing to its grain: append instead of update, denormalize instead of join, order the table by how you query it, and accept that consistency is eventual. The teams that get burned are the ones that lift their Postgres schema across unchanged and wait for the magic. The magic is conditional. It is all in the ORDER BY.