Hands-On: Adding ClickHouse as an Observability Backend for High-Cardinality Metrics
Hands-on tutorial: model high-cardinality telemetry in ClickHouse, implement rollups & TTL, and connect to Grafana for scalable observability.
Hook: Why your current metrics pipeline breaks at scale — and how ClickHouse fixes it
If you’re battling exploding label cardinality, runaway storage bills, and slow Grafana dashboards, you’re not alone. Modern distributed systems generate millions of unique metric series every hour. Traditional TSDBs tuned for low-cardinality Prometheus use-cases either struggle with scale or force lossy aggregation. In 2026, many engineering teams are moving telemetry into ClickHouse to get cost-efficient OLAP performance for high-cardinality observability pipelines. This hands-on guide shows how to model metrics for scale, implement retention and downsampling, and connect ClickHouse to Grafana — with code, configs, and practical tuning tips you can apply today.
The 2026 context: why ClickHouse for observability now?
In late 2025 and early 2026, ClickHouse continued its rapid adoption in analytics and observability workloads. Large funding rounds and ecosystem growth increased engineering resources and integrations (Grafana plugins, Prometheus adapters, Vector sinks, and more). The result: production-ready patterns for high-cardinality telemetry are mature enough to recommend in critical pipelines.
Key trend: Teams are centralizing high-cardinality metrics in ClickHouse to cut storage costs, accelerate complex queries, and support advanced rollups without losing the ability to rehydrate raw samples when needed.
Overview: The architecture we’ll build
- Ingest raw metric points (labels + value + ts) into a raw events table.
- Materialize minute and hourly rollups for fast dashboards using materialized views + AggregatingMergeTree.
- Use TTL rules and storage policies to move old data to cold volumes or delete it.
- Expose data to Grafana using the ClickHouse datasource and efficient SQL patterns.
1) Data model: designing for high-cardinality
High-cardinality comes from label combinations. A core principle: store a compact, query-friendly identifier for a unique metric series, and keep labels normalized or compressed to reduce storage and speed lookups.
Schema decisions
- metric_id (UInt64): deterministic hash of metric name + sorted labels. Use sipHash64 or cityHash64 to produce a stable id.
- metric (LowCardinality(String)): the metric name — low cardinality by nature.
- labels (Nested or Map): store labels in a map or Nested column; convert high-cardinality label values to LowCardinality where possible.
- ts (DateTime64): timestamp of the sample (millisecond precision).
- value (Float64): sample value. For counters, store raw increments or both value + counter metadata.
Example table: raw events
CREATE TABLE metrics_raw (
metric_id UInt64,
metric LowCardinality(String),
labels Nested(key String, value String),
ts DateTime64(3, 'UTC'),
value Float64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(ts)
ORDER BY (metric_id, ts)
SETTINGS index_granularity = 8192;
Notes: ORDER BY (metric_id, ts) groups series physically and makes per-series time-range reads efficient. index_granularity controls index density; 8192 is a reasonable starting point for busy clusters.
2) Ingest pipeline patterns
Two common ingestion paths:
- Prometheus remote_write adapter → ClickHouse (adapter projects matured in 2025)
- Log/metrics router (Vector/Fluent Bit) → ClickHouse HTTP or native protocol
Example: Vector sink config (2026-optimized)
[sinks.clickhouse]
type = "clickhouse"
inputs = ["prom_receiver"]
endpoint = "https://clickhouse.example.internal:8443/"
database = "observability"
table = "metrics_raw"
compression = "lz4"
batch.timeout_secs = 1
batch.max_bytes = 16_000_000
Batching, compression and TLS are essential. Modern sinks support transforming label sets into arrays or map columns before writing.
Prometheus remote_write adapter
If you rely on Prometheus scrapers, use a compatible remote_write adapter that converts samples to ClickHouse inserts or to a queue that Vector consumes. This keeps alerting and PromQL workflows intact while centralizing long-term storage.
3) Downsampling and rollups: raw → 1m → 1h
Keep raw data for short-term troubleshooting (e.g., 7–14 days), and maintain preaggregated rollups for longer retention. Materialized views are the standard pattern:
Why AggregatingMergeTree?
AggregatingMergeTree stores aggregation states efficiently and allows compact rollups (sumState, minState, maxState, countState). When queried with FINAL or using aggregateFunctionMerge(), they produce aggregated values without reprocessing raw rows.
Create a 1-minute rollup
CREATE TABLE metrics_1m (
metric_id UInt64,
metric LowCardinality(String),
labels Nested(key String, value String),
minute DateTime64(3, 'UTC'),
sum_value AggregateFunction(sum, Float64),
cnt AggregateFunction(count, UInt64)
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(minute)
ORDER BY (metric_id, minute);
CREATE MATERIALIZED VIEW mv_metrics_1m
TO metrics_1m
AS
SELECT
sipHash64(metric, arrayStringConcat(labels.key, '::', labels.value)) AS metric_id,
metric,
labels,
toStartOfMinute(ts) AS minute,
sumState(value) AS sum_value,
countState() AS cnt
FROM metrics_raw
GROUP BY metric_id, metric, labels, minute;
Repeat the pattern for hourly rollups from the 1m table to reduce CPU during aggregation.
Hourly rollup from 1m -> 1h
CREATE TABLE metrics_1h (
metric_id UInt64,
metric LowCardinality(String),
labels Nested(key String, value String),
hour DateTime64(3, 'UTC'),
sum_value AggregateFunction(sum, Float64),
cnt AggregateFunction(count, UInt64)
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (metric_id, hour);
CREATE MATERIALIZED VIEW mv_metrics_1h
TO metrics_1h
AS
SELECT
metric_id,
metric,
labels,
toStartOfHour(minute) AS hour,
sumState(sum_value) AS sum_value,
sumState(cnt) AS cnt
FROM metrics_1m
GROUP BY metric_id, metric, labels, hour;
Query the rollups with finalizeAggregation functions or use sumMerge(sum_value) and sumMerge(cnt) to get numeric values.
4) Retention and tiered storage
Modern ClickHouse supports storage policies across local disks and cloud object stores. Use TTL rules to move older partitions to colder volumes or delete them.
Example: TTL with move-to-volume and delete
ALTER TABLE metrics_raw
MODIFY TTL
ts + INTERVAL 7 DAY TO VOLUME 'hot',
ts + INTERVAL 30 DAY TO VOLUME 'cold',
ts + INTERVAL 90 DAY DELETE;
Explanation:
- Keep raw hot for 7 days for fast debugging.
- Move to cold (S3 or object storage) for up to 30 days.
- Delete after 90 days to enforce cost controls.
Use per-table TTLs on rollup tables with longer horizons (e.g., metrics_1m 90 days, metrics_1h 2 years).
5) Query patterns for Grafana and dashboards
Design queries to scan the smallest rollup that satisfies resolution. Grafana panels usually want a (time, value, metric) triplet per series. Use rollups to return fewer rows and avoid scanning raw data for long ranges.
Grafana ClickHouse query (panel expecting time-series)
SELECT
toUnixTimestamp(toStartOfMinute(minute)) AS time_sec,
sumMerge(sum_value) / sumMerge(cnt) AS value,
metric as metric
FROM metrics_1m
WHERE metric = 'http_request_duration_seconds'
AND minute BETWEEN toDateTime($__from) AND toDateTime($__to)
GROUP BY time_sec, metric
ORDER BY time_sec;
Use Grafana macros where possible (e.g., $__from/__to). For long-range queries (>30d), switch to metrics_1h for efficiency.
6) Cardinality control techniques (practical recipes)
High-cardinality label explosion is the main problem. Apply these recipes:
- Normalize static labels: move 'service', 'env', 'region' to separate low-cardinality columns or dictionary tables.
- Hash high-cardinality composites: compute metric_id = sipHash64(metric + sorted_labels) to avoid storing combinatorial label strings repeatedly.
- Bucket rare label values: map rare values to "other" or use topN dictionaries for UI selection.
- Use LowCardinality() for label values that repeat often (language, os, region).
- External dictionaries: use ClickHouse dictionaries to map id→metadata and keep the main table narrow and fast.
7) Query tuning and operational tips
Small changes have big effects on cost and latency.
- Specify the metric_id predicate first in WHERE clauses to exploit ORDER BY ordering.
- Avoid wide SELECT *. Project only needed columns.
- Use sample and limit for exploratory queries to avoid full table scans.
- Index granularity: increase index_granularity for lower memory but more seeks; reduce it for more responsive per-series reads.
- Use distributed tables with shard-local reads and prefer local reads when possible — see auto-sharding patterns for cluster scaling.
- Monitor merges: long-running merges can create temporary spikes. Tune max_bytes_to_merge_at_min and related settings — operations guidance similar to operational playbooks.
8) Handling counters and monotonic metrics
Counters need special aggregation: compute per-series increases instead of naive sums. Two approaches:
- Compute deltas at ingest time in your collector (preferred): emit increases as separate samples.
- Store monotonic raw samples and derive increases during rollup using argMax or windowed difference functions.
Example: delta in rollup
-- pseudo-query: compute per-minute rate from raw samples
SELECT
metric_id,
toStartOfMinute(ts) AS minute,
sum(if(value > prev_value, value - prev_value, 0)) AS increases
FROM (
SELECT *, lag(value) OVER (PARTITION BY metric_id ORDER BY ts) AS prev_value
FROM metrics_raw
)
WHERE prev_value IS NOT NULL
GROUP BY metric_id, minute;
Window functions can be expensive. If you can compute increases upstream, the rollup is far cheaper.
9) Observability best practices and operational playbook
- Track ingestion lag, write latency, partition sizes, and merge activity with dedicated internal dashboards.
- Alert on increasing cardinality trends and sudden influxes of new label keys.
- Automate retention policy changes by environment (e.g., dev = 7d raw, prod = 30d raw).
- Test rollup correctness in staging with replayed traffic before applying to prod.
10) Example end-to-end checklist (practical)
- Design metric_id scheme and label normalization rules.
- Create raw table with ORDER BY (metric_id, ts) and appropriate partitioning.
- Deploy ingestion (Vector or remote_write adapter) with batching and compression.
- Create 1m and 1h rollups via materialized views using AggregatingMergeTree.
- Apply TTLs to move raw → cold → delete and longer TTLs on rollups.
- Hook Grafana to the ClickHouse datasource and point dashboards to the rollups.
- Monitor cardinality and tune index_granularity and merge settings.
Advanced: projection and pre-aggregation strategies (2026)
In 2025–2026, ClickHouse projections and native pre-aggregation capabilities have become more robust. Consider projections for commonly used GROUP BY time/metric queries to substantially reduce query CPU. Use projections alongside materialized views when workload patterns are stable.
Real-world example: cost savings and results
Teams adopting this architecture in 2025–2026 reported:
- 40–70% reduction in storage cost vs storing raw samples only in a cloud TSDB (after tiering + rollups).
- Queries over 30d ranges running 3–10x faster when routed to preaggregated tables.
- Ability to run ad-hoc cardinality analysis using ClickHouse's fast aggregate primitives.
Common pitfalls and how to avoid them
- Never assume labels won’t grow: build label governance and automated mapping rules.
- Don’t use naive JSON blob columns for labels — they’re inefficient for query filters.
- Avoid doing expensive windowed computations on raw tables at query time; push aggregation into rollups or upstream collectors.
Wrapping up: when to choose ClickHouse vs dedicated TSDB
Choose ClickHouse when you need:
- Fast ad-hoc analytics across millions of series
- Cost-effective long-term storage with tiered volumes
- Flexible, SQL-driven rollups and joins with other observability data
Consider a dedicated, purpose-built TSDB (or a hybrid approach) if you need native PromQL with extreme write throughput and minimal operational complexity — but in 2026, ClickHouse is a compelling, production-proven choice for high-cardinality observability when paired with the patterns above.
Actionable takeaways (start here)
- Implement metric_id hashing and schema normalization in your collector this week.
- Spin up a small ClickHouse cluster or cloud instance and create the
metrics_raw+ rollup tables above. - Pipe a subset of traffic through Vector or a Prometheus adapter to validate ingestion and rollups.
- Connect Grafana with the ClickHouse plugin and point a dashboard at your 1m rollup for fast iteration.
Further reading and references (2025–2026)
- ClickHouse docs: MergeTree engines, TTL and storage policies (check official docs for latest 2026 features)
- Grafana ClickHouse datasource plugin (improvements in 2025–2026)
- Vector sink docs for ClickHouse and Prometheus remote_write adapters
Call to action
If you want a ready-made reference repo and Terraform module that implements the exact pipeline in this guide (ClickHouse infra, Vector configs, materialized views, Grafana dashboards), download the sample project or book a 30-minute design review with our engineers. We’ll review your labels, suggest an optimal metric_id scheme, and size the ClickHouse cluster for your cardinality and compliance targets.
Start the migration — get the repo, run the reference architecture in a staging account, and measure the wins (cost, latency, cardinality controls) within a week.
Related Reading
- Edge Datastore Strategies for 2026: Cost-Aware Querying
- Review: Distributed File Systems for Hybrid Cloud in 2026
- Edge-Native Storage in Control Centers (2026)
- News: Mongoose.Cloud Launches Auto-Sharding Blueprints
- Don’t Forget the Classics: Why Arc Raiders Must Keep Its Old Maps
- Drakensberg Lodges: Affordable Stays and Luxury Options Near the Best Trails
- All Splatoon Amiibo Rewards in ACNH — A Quick Reference and Showcase
- Lego Zelda vs Other Video Game LEGO Sets: Which Offers the Best Collector Value?
- Portable Hot Food Kits, Power and Comfort: Field Guide for In‑Home Carers — 2026 Buyer’s Review
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Optimizing Device Performance: Lessons from Galaxy Watch Bug Fixes
Understanding Vulnerabilities in Bluetooth Protocols: Lessons from WhisperPair
TikTok’s Age Detection & Model Risk: Building an ML Model Governance Checklist
Enhancing Gamepad Experience: Integration Strategies for Cloud-Based Gaming
Securing the Micro App Supply Chain: GitOps Patterns for Citizen Developers
From Our Network
Trending stories across our publication group
Harnessing the Power of AI in Globally Diverse Markets
Case Study: The Cost-Benefit Analysis of Feature Flags in Retail Applications
