← All posts
Observability··9 min read

High-Cardinality Data: The Observability Superpower You're Probably Avoiding

Teams strip user_id, request_id, and tenant_id out of their telemetry to save money, then wonder why they can't debug production. Cardinality is not the enemy — it's the point.

"We can't add user_id to that metric. It would explode our cardinality."

I have heard that sentence in probably a hundred conversations. Every time, I want to say: the explosion is the product. The cardinality is the thing that would let you actually debug production. Stripping it out to save money is like buying a car and then removing the steering wheel because gas is expensive.

High cardinality is not a pathology of observability. It's the entire point of observability. The enemy isn't cardinality — it's a pricing model that punishes cardinality.

A short definition

Cardinality is the number of unique values a dimension can take. http_method has cardinality 7 or so (GET, POST, PUT, DELETE, etc.). status_code has maybe 40. user_id has cardinality equal to your user count — millions, potentially.

"High cardinality" usually means any dimension whose possible values number in the thousands or more. Examples that scare people: user_id, request_id, tenant_id, session_id, feature_flag_combination, product_sku.

These are, not coincidentally, exactly the dimensions you need to slice on when debugging. "Is this slow for everyone or just the enterprise tenants?" requires tenant-level cardinality. "Does this error affect one user repeatedly or every user once?" requires user-level cardinality. "Which request got stuck?" requires request-level cardinality. Without these, you're reduced to aggregate guesses.

Why metrics hate cardinality

Classical metrics systems — Prometheus, StatsD, most vendor pre-aggregation products — are built on an implicit assumption: the number of unique time series is bounded and small. A time series is defined by the metric name plus the set of label/value combinations, and each time series costs memory and storage.

If you add a user_id label to a metric, a system with a million users now has a million time series per metric. Prometheus falls over. Your bill goes supernova. The vendor sends you a stern email. The natural response is to strip the high-cardinality labels, and the natural consequence is that you can no longer answer user-level questions.

The structural problem is that metrics pre-aggregate. They compute counts and summaries at ingestion time, which requires deciding in advance what dimensions you care about. "The next question you want to ask" is exactly the dimension you didn't include. You cannot add a dimension after the fact, because the raw data is gone.

The alternative: events, not metrics

Wide events — a single structured record per meaningful unit of work, with every relevant dimension attached — solve the cardinality problem by not pre-aggregating at all.

A wide event for a checkout request might look like:

{
  "timestamp": "2026-01-14T18:03:22.471Z",
  "service": "checkout-api",
  "endpoint": "/checkout",
  "method": "POST",
  "duration_ms": 842,
  "status": 200,
  "user_id": "u_9a2c8f...",
  "tenant_id": "t_acme",
  "user_tier": "enterprise",
  "feature_flags": ["new-coupon-flow", "fraud-v3"],
  "cart_item_count": 7,
  "payment_method": "stripe_card",
  "region": "us-east-1",
  "db_queries": 14,
  "cache_hit_ratio": 0.71,
  "trace_id": "4bf9...",
  "span_id": "00f0..."
}

Notice what's there: every dimension you might ever want to filter, group by, or correlate on. Every dimension is arbitrarily high-cardinality. The event doesn't care. It's just a row in a column-oriented store.

Metrics become trivial aggregations over this data: the p99 ofduration_ms grouped by endpoint and user_tier is just a query. You can ask it now. You can ask it later. You can ask a new version of it that didn't exist yesterday.

The storage shape that makes this cheap

The reason wide events feel expensive in a metrics database and cheap in a columnar one is the storage model. Columnar stores — ClickHouse, BigQuery, Snowflake, the engines under Honeycomb and others — store each column separately, compress it aggressively (high-repetition columns compress to nothing), and read only the columns your query touches.

A wide event with 50 columns doesn't cost 50x what a narrow one with 5 columns costs. It costs proportionally more for the columns you query, and almost nothing extra for the rest. High-cardinality columns with useful entropy (likeuser_id) are the expensive ones. Low-cardinality columns (like status_code) compress to nearly nothing.

This is why the industry has been quietly moving toward "events into a columnar store" as the observability substrate — and why the old "three pillars" architecture feels increasingly anachronistic.

Sampling done right

The natural follow-up: if events are cheap but not free, how do you keep the volume sane? The answer is sampling — but not head-based random sampling, which destroys your ability to investigate the tail.

Tail-based sampling makes the keep/discard decision after the trace completes, so you can bias toward the interesting traces: all errors, all slow traces, a sample of normal traces. You keep the rare valuable stuff and sample down the routine stuff.

Dynamic sampling adjusts sample rates based on frequency: a high-volume endpoint might be sampled at 1%, while a low-volume admin endpoint is kept at 100%. The result is a dataset where the population of events looks balanced, not dominated by whatever endpoint happens to get the most traffic.

Both techniques require that you sample by trace, not by event. If you sample events independently, you'll have spans in your trace but not their parents, and your trace view becomes useless. Sample the decision at the root; apply it to every child.

"But metrics aren't going away"

They aren't. There are legitimate use cases for pre-aggregated metrics: dashboards at a glance, high-frequency counters that don't need per-event detail, capacity planning signals. The argument is not "never use metrics." The argument is:

  • Use wide events as the primary store. That's where you go to debug, investigate, ask questions you didn't foresee.
  • Derive metrics from events when you need pre-aggregation for dashboards or alerts. The metric is a view, not a primary data source.
  • Preserve cardinality where it matters. On your wide events, keep every useful dimension. Strip only when the dimension has no investigative value.

The cost conversation

The hard reality: this costs money. Keeping every user ID on every event for 30 days is not free. But neither is a four-hour incident that you couldn't debug because the user ID wasn't in the metric.

The argument I've found effective with finance: the cost of high-cardinality observability is knowable and invoiced. The cost of not having it is borne in engineer-hours, customer-churn, and missed SLOs — costs that don't show up on a line item but dwarf the observability bill. Frame the conversation as insurance, not infrastructure.

What to do this quarter

  1. Audit one service's telemetry for stripped cardinality. Look at the last three incidents and ask: "What dimension would I have wanted to filter on, and why isn't it in our data?" That's your missing cardinality. Add it back.
  2. Instrument one flow with wide events. Pick your highest-value user flow. Emit one event per request with every relevant dimension. Route it to a columnar store — even a small ClickHouse instance is enough to start. Feel the difference in debuggability.
  3. Move one alert from metric-based to event-based.Take your noisiest or least actionable alert and rewrite it as a query against the event store. The richer data almost always produces a sharper alert.

High cardinality is a feature. Stop apologizing for it. The systems you operate are complex and the people who operate them deserve tools that let them ask real questions — not tools that force them to pick five dimensions in advance and pray.

#observability#cardinality#events#cost

© 2026 kaushaldalvi.com, All rights reserved.