clickhousemonitoringtutorial

Integrating ClickHouse With Monitoring Stacks: A Step-by-Step Tutorial

UUnknown

2026-02-01

9 min read

Hands-on guide to ingest metrics into ClickHouse, design time-series schema, optimize queries, and build Grafana dashboards for observability teams.

Hook: Why observability teams are moving metrics into OLAP in 2026

If you manage monitoring for multi-cloud systems, you already feel the pain: exploding metrics cardinality, slow dashboards, and storage bills that surprise Finance every quarter. Over the last 18 months (late 2024–early 2026) production teams have shifted from purpose-built TSDBs to OLAP engines like ClickHouse to regain control — trading clever schema design and pre-aggregation for orders-of-magnitude improvement in query speed and cost-efficiency. ClickHouse's continued growth (including a major funding round reported in early 2026) reflects that momentum and tooling maturity.

What you'll get from this tutorial

This is a hands-on, step-by-step integration guide for observability teams. You will learn how to:

Design high-performance time-series table schemas for metrics in ClickHouse
Ingest metrics from Prometheus, Vector/Telegraf, and Kafka into ClickHouse
Create roll-ups and use materialized views to downsample and accelerate dashboards
Build Grafana visualizations optimized for ClickHouse queries
Apply query and storage optimizations, retention and TTL and tiered storage strategies, and cardinality controls

Overview: Reference architecture

A common production architecture in 2026 uses a lightweight collector (Prometheus/Vector/Telegraf) -> message bus (Kafka optional) -> ClickHouse. Grafana queries ClickHouse directly via the ClickHouse datasource plugin. Materialized views and aggregate tables do the heavy lifting for dashboards.

Diagram (logical): Collector -> Kafka -> ClickHouse (Kafka engine & Materialized Views -> MergeTree storage) -> Grafana

Step 1 — Schema patterns: raw, aggregated, and dimension tables

Good schema design is the single most important factor for speed and cost. Use a tiered schema strategy:

Raw events table (high ingest rate, short retention) — store original metrics with skinny columns and LowCardinality tags.
Downsampled aggregates (per-minute/per-5m) — used by dashboards and alerts; long retention.
Dimension/lookup dictionaries — normalize tag keys and values to lower cardinality and speed joins.

Example raw metrics table

Key design choices: PARTITION by month (or by week for higher write volume), ORDER BY a composite primary key to allow efficient range queries and prewhere filtering, and use LowCardinality(String) for repeated tag values.

CREATE TABLE metrics_raw (
  timestamp DateTime64(3, 'UTC'),
  metric_name String,
  value Float64,
  host LowCardinality(String),
  service LowCardinality(String),
  tags Nested(key LowCardinality(String), value String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (metric_name, host, timestamp)
TTL timestamp + INTERVAL 30 DAY
SETTINGS index_granularity = 8192;

Why Nested for tags?

Nested is useful when you want to preserve label sets without exploding columns. For high-cardinality label keys you should still normalize with dictionaries or limit indexed labels to the most important ones.

Aggregated table (per-minute)

CREATE MATERIALIZED VIEW mv_metrics_minute
TO metrics_minute
AS
SELECT
  toStartOfMinute(timestamp) AS ts,
  metric_name,
  host,
  avg(value) AS avg_value,
  max(value) AS max_value,
  min(value) AS min_value,
  count() AS samples
FROM metrics_raw
GROUP BY ts, metric_name, host;

CREATE TABLE metrics_minute (
  ts DateTime,
  metric_name String,
  host LowCardinality(String),
  avg_value Float64,
  max_value Float64,
  min_value Float64,
  samples UInt64
) ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(ts)
ORDER BY (metric_name, host, ts);

Step 2 — Ingesting metrics into ClickHouse

Choose the ingress path that fits your environment. Below are three common pipelines with concrete examples.

Option A: Prometheus -> ClickHouse (remote_write adapter)

Use an adapter that accepts Prometheus remote_write and writes to ClickHouse (community adapters matured in 2025–2026). The adapter typically buffers and converts TSDB samples to JSONEachRow or inserts into Kafka.

# Example: prometheus.yml (remote_write)
remote_write:
  - url: "http://prom-clickhouse-adapter:9201/api/v1/prom/write"
    remote_timeout: 30s

Adapter configuration (example) — it writes to ClickHouse HTTP insert in JSONEachRow:

adapter:
  clickhouse:
    url: "http://clickhouse:8123/"
    database: metrics_db
    table: metrics_raw
    format: JSONEachRow
    batch_size: 10000

Option B: Vector / Telegraf -> HTTP Insert

Vector and Telegraf have stable ClickHouse sinks. Vector can convert Prometheus scrape data or metrics in other formats into JSONEachRow and deliver over HTTP with backpressure.

# Vector config example (transform + sink)
[sources.prom]
  type = "prometheus_scrape"
  targets = ["0.0.0.0:9100"]

[transforms.to_clickhouse]
  type = "remap"
  inputs = ["prom"]
  source = '''
  .timestamp = to_timestamp(.metrics[0].time)
  .metric_name = .metrics[0].name
  .value = .metrics[0].value
  ...
  '''

[sinks.clickhouse]
  type = "http"
  inputs = ["to_clickhouse"]
  uri = "http://clickhouse:8123/?query=INSERT%20INTO%20metrics_raw%20FORMAT%20JSONEachRow"
  encoding.codec = "json"

Option C: Kafka -> ClickHouse (high throughput)

For massive scale, publish metrics to Kafka and use ClickHouse's Kafka engine with a materialized view that writes into a MergeTree table. This pattern decouples producers and lets ClickHouse read at its own pace.

CREATE TABLE kafka_metrics
(
  key String,
  value String
) ENGINE = Kafka SETTINGS
  kafka_broker_list = 'kafka1:9092',
  kafka_topic_list = 'metrics',
  kafka_group_name = 'ch_metrics_group',
  kafka_format = 'JSONEachRow';

CREATE MATERIALIZED VIEW mv_kafka_to_raw TO metrics_raw AS
SELECT
  parseDateTimeBestEffort(value:timestamp) AS timestamp,
  value:metric_name AS metric_name,
  toFloat64(value:value) AS value,
  value:host AS host,
  value:service AS service
FROM kafka_metrics;

Step 3 — Grafana: ClickHouse datasource and example dashboards

Grafana's ClickHouse datasource plugin (2024–2026 improvements) supports time-series and table panels. Use $__timeFilter and $__timeGroup macros to keep queries performant in dashboards.

Basic time-series panel query

SELECT
  toStartOfMinute(ts) AS time,
  avg(avg_value) AS value,
  metric_name
FROM metrics_minute
WHERE metric_name = 'http_requests_total'
  AND $__timeFilter(ts)
GROUP BY time, metric_name
ORDER BY time

Use $__timeGroup(ts, '1m') if you prefer Grafana's macro to handle grouping dynamically. Also recommend using the AggregatingMergeTree or pre-aggregated tables for panels that cover long time ranges.

Percentiles and latency heatmaps

For latency histograms, store histogram buckets or use ClickHouse's aggregate functions (quantilesApprox) on raw values. Example:

SELECT
  toStartOfMinute(ts) AS time,
  quantilesExact(0.5,0.9,0.99)(value) AS p
FROM metrics_raw
WHERE metric_name = 'request_latency_ms' AND $__timeFilter(ts)
GROUP BY time
ORDER BY time

Step 4 — Performance and query optimization

ClickHouse is fast when schema and queries are aligned. These pragmatic optimizations are essential for observability workloads.

ORDER BY matters: put your most-filtered columns first (metric_name, host, timestamp). See our lightweight stack audit advice when aligning your query patterns to storage.
Use PREWHERE: Prewhere filters reduce IO by discarding blocks before decompressing columns (good for metric_name filtering).
LowCardinality: Use LowCardinality(String) for tag-like columns to shrink indexes and speed comparisons.
Skip indexes: Add set/bloom_filter indexes for high-cardinality tag searching.
Compression codecs: Apply ZSTD(level) or LZ4 on high-cardinality text; use DoubleDelta for numeric sequences. For secure, long-term tiering and policies see the zero-trust storage playbook.
Materialized views & rollups: AggregatingMergeTree stores aggregate states efficiently and massively reduces runtime computation for dashboards.

Example: PREWHERE + time grouping

SELECT toStartOfMinute(ts) as time, avg(value) as value
FROM metrics_raw
PREWHERE metric_name = 'cpu_usage'
WHERE ts BETWEEN toDateTime(1633046400) AND toDateTime(1633132799)
GROUP BY time
ORDER BY time

Step 5 — Controlling cardinality and cost (FinOps for metrics)

In 2026, observability teams must balance fidelity and cost. Implement these tactics to keep storage manageable:

Client-side sampling: Drop or sample infrequent labels at ingest.
Label whitelisting: Only persist a curated list of label keys; place others in a separate events table for ad-hocensics.
Downsampling rules: Keep raw for N days, minute aggregates for M months, and hourly aggregates for long-term retention.
Use TTL and tiered storage: ClickHouse supports moving old parts to object storage (S3) and dropping after TTL to save hot storage cost. See the storage playbook for policy ideas.

Example TTL + tiered storage

ALTER TABLE metrics_minute
MODIFY TTL
  ts + INTERVAL 90 DAY TO VOLUME 'cold',
  ts + INTERVAL 365 DAY DELETE;

Step 6 — Advanced strategies for observability

These techniques are used by mature teams to handle high-cardinality telemetry and build reliable dashboards and alerts.

Dictionary tables - use ClickHouse dictionary sources for mapping service IDs to metadata without joining large tables.
AggregatingMergeTree + aggregate functions state - compute and store aggregate states to merge later, reducing recompute cost.
Approximate data structures - use HLL (uniqExact or uniqCombined) for distinct counts and quantileApprox for percentiles.
Backfill pipelines - use Kafka + ClickHouse to replay missed windows without blocking writes; local-first replay appliances and tools can simplify field backfills.

Security, access control, and operational notes

Production observability data is sensitive. Use TLS for HTTP inserts, enable RBAC, and isolate ClickHouse clusters for metrics workloads. Monitor ClickHouse itself (query queue, memory pressure, merges) with the system tables. For storage and access governance guidance, consult the Zero-Trust Storage Playbook.

-- Monitor long running queries
SELECT * FROM system.processes WHERE elapsed > 60;

-- Check merges backlog
SELECT * FROM system.merges ORDER BY create_time DESC LIMIT 10;

Operational checklist before production rollout

Load-test ingest using realistic cardinality and query patterns (use metricbench or custom generator).
Define retention and downsampling SLAs with FinOps and SRE teams.
Provision cluster sizing with headroom: CPU-heavy for query patterns and IO/SSD optimized for merges.
Setup alerting on ClickHouse health: merges backlog, replica lag, disk pressure.
Document runbooks for replays, schema migrations, and TTL changes.

Real-world notes & 2026 trends

In late 2025 and early 2026, enterprises increasingly use OLAP for observability to consolidate metrics, logs, and traces into a single analytical store. ClickHouse’s rapid adoption and investment (reported in January 2026) reflect maturity in ecosystem plugins and adapters — making integrations with Prometheus, Grafana, Kafka, and vectorized collectors more reliable. Expect more managed ClickHouse offerings and tighter Grafana plugin features throughout 2026. See our market note on why 2026 expectations are driving adoption.

Troubleshooting common issues

Slow dashboard queries: confirm queries use an aggregated table or materialized view; add PREWHERE filters and check ORDER BY alignment.
High disk usage: evaluate retention rules and compression codecs; use TTL to move old parts to cold storage.
Write latency spikes: monitor merges backlog and tune index_granularity; consider increasing hardware or adding shards.
Cardinality explosion: implement label whitelisting and client-side sampling — tactics covered in our observability & cost control playbook.

Example: End-to-end snippet (Prometheus -> Adapter -> ClickHouse -> Grafana)

Minimal curl insertion for testing. Use JSONEachRow with timestamps and tag fields.

curl -sS -u default:password -H "Content-Type: application/json" \
  -X POST 'http://clickhouse:8123/?query=INSERT%20INTO%20metrics_raw%20FORMAT%20JSONEachRow' \
  -d '[{"timestamp":"2026-01-17 12:00:00","metric_name":"cpu_usage","value":12.3,"host":"host-1","service":"api"}]'

Actionable takeaways

Design for queries first: choose ORDER BY and partitions to match dashboard filters.
Use a layered schema: raw, minute aggregates, hourly aggregates to control cost and speed.
Pick the right ingestion path: Prometheus adapter for simplicity, Vector for flexibility, Kafka for scale.
Optimize queries: PREWHERE, LowCardinality, skip indexes, and materialized views are your primary levers.
Plan retention and FinOps: TTL, tiered storage, and sampling keep costs predictable. For quick cost-control wins, see our one-page stack audit.

Call to action

Ready to prototype ClickHouse as your observability engine? Start with a 2-week pilot: deploy a ClickHouse test cluster (single node), point a Prometheus scrape job or Vector sink at it, create the raw and minute aggregate tables above, and build 3 core Grafana dashboards (SRE, service owner, FinOps). If you want, download our checklist and tooling repo to accelerate the pilot — or contact the team for a hands-on workshop to migrate one dashboard from your current TSDB to ClickHouse.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.