Grafana Loki Deep Intuition
An experienced engineer's guide to Grafana Loki
1. One-Sentence Essence
Loki is a log database that throws away the inverted index and replaces it with two ideas: a tiny metadata index over log streams, and brute-force grep across compressed chunks in object storage.
That sentence is the whole game. Everything else — labels, cardinality, chunks, ingesters, the brute-force scan — is just the consequences of that one decision falling out into the world.
2. The Problem It Solved
Before Loki, if you wanted to centralize logs at any meaningful scale, the realistic answer was Elasticsearch (the “E” in ELK). Elasticsearch is a search engine that happened to be used for logs. When a log line came in, it was tokenized, every word was indexed into an inverted index, and the line and its index were stored across a cluster of stateful data nodes running the JVM. This was fast for arbitrary text search (“show me every log containing the word connection_refused across the last 30 days”), and it was expensive in every direction that matters: storage (the index could be larger than the raw logs), memory (Elasticsearch lives in the JVM heap), CPU (segment merges), and operational toil (shard management, rebalancing, hot/warm/cold tiers, careful upgrades). At any company past a certain size, the logging bill rivalled the database bill, and the team running Elasticsearch was somebody’s full-time job.
The people who would go on to build Loki were running Prometheus at Grafana Labs and noticed something. With Prometheus, you don’t search metrics by their values — you search by their labels. A query like http_requests_total{method="POST", status="500"} is a filter on metadata, not a search over the data. It is fast because the only thing you have to index is a small set of label key-value pairs per time series. And, critically, you almost never want to search the actual numeric values of metrics — you want to find the series, then aggregate or visualize them.
The insight: most log queries are the same shape. When you’re debugging in production, you almost never start with “find every log that contains the word X across everything.” You start with “I know it’s the payments service in the prod cluster — show me its logs.” The label-style filter does 90% of the narrowing for free. Once you’ve narrowed to the right pile of logs, a grep through them is fast enough.
So the design decision: index the labels, not the content. Compress the content. Throw it in cheap object storage (S3, GCS, Azure Blob). Use the tiny label index to find the right chunks, then brute-force scan them. The storage is dirt cheap, the index stays small, the operational complexity collapses, and the query model fits the real use case.
Two more forces shaped Loki: it was built for Kubernetes, where logs are ephemeral by default and structured labels (namespace, pod, container, cluster) are already there for the taking; and it was built by a metrics company, so it inherited the Prometheus mental model deliberately — labels, PromQL-shaped query language (LogQL), the same agents and patterns. Loki is what happens when you ask “what does Prometheus for logs look like?“
3. The Concepts You Need
Loki has its own vocabulary. You can’t reason about it without these.
The data layer
Log stream. A log stream is a unique combination of label key-value pairs. {namespace="prod", app="payments", container="api"} is one stream. Add a fourth label and you might have two streams. Streams are the unit of everything in Loki — they are what gets routed, what gets stored, what gets queried first. The number of streams is the single most important number in a Loki deployment.
Label. A key-value pair attached to log lines at ingestion time. Labels form the index. The right labels are static, low-cardinality descriptors of the source of logs — cluster, namespace, app, region, environment. The wrong labels are dynamic, high-cardinality, or per-event values — user_id, request_id, trace_id, ip_address. Confusing these two is the most common Loki failure mode.
Cardinality. The total number of unique value combinations across all your labels. If you have cluster (10 values), namespace (50 values), app (200 values), and level (4 values), your worst-case cardinality is 10×50×200×4 = 400,000 streams. Add a user_id label and you’ve just added millions. High cardinality kills Loki — large index, tiny chunks, OOM’d ingesters, slow queries, the works.
Chunk. The unit of storage. A chunk holds compressed log lines for a single stream over a bounded time range (typically up to ~30 minutes or until ~1.5 MB compressed). When a chunk is “full” (by size or age) it gets flushed to object storage as an immutable blob. Chunks are per-stream, which is why high cardinality is so painful: more streams means more, smaller chunks; each chunk is a separate object in S3; query latency goes up; storage costs go up; everything degrades.
Index. A small data structure mapping {labels} → list of chunk references for a given time range. In modern Loki (since 2.0) the index is a TSDB file (borrowed from Prometheus) that lives in object storage alongside the chunks. There is no separate index database. The TSDB index is tiny compared to the chunks — typically <1% of the raw data size.
Structured metadata. Per-log-line metadata (added in Loki 3.0) that is not indexed but is attached to each line and queryable. This is the right home for high-cardinality data that you genuinely need to query on — trace IDs, request IDs, span IDs. It lives inside the chunk, not in the index. It is the deliberate escape hatch for “I have something cardinality-explosive that I still need to query.”
The components
Distributor. The HTTP write endpoint. Receives push requests, validates them (label format, line size, timestamp window, rate limits), hashes each stream’s labels, and forwards the lines to the right ingesters. Stateless. Scales horizontally.
Ingester. The stateful heart of the write path. Holds active streams in memory, builds chunks, and flushes them to object storage. Backed by a Write-Ahead Log (WAL) on local disk for crash recovery. The ingester is where data lives between being received and being persisted — unflushed data only exists on ingesters, which is why ingesters are usually replicated (replication factor 2 or 3).
Querier. Executes LogQL queries. Pulls the relevant chunks from object storage (or its cache), decompresses them, and brute-force scans the log lines. Also queries ingesters directly for in-memory data that hasn’t been flushed yet. Stateless. Scales horizontally.
Query frontend. Sits in front of the queriers. Splits incoming queries into many smaller time-range subqueries (e.g. a 24-hour query becomes 96 parallel 15-minute subqueries), caches results, and stitches the subquery responses back together. The single biggest lever for query performance in a non-trivial Loki cluster.
Query scheduler. A separate queue between the query frontend and the queriers, used in larger deployments. Implements per-tenant fairness — one tenant’s expensive query can’t starve everyone else.
Index gateway. Caches the TSDB index files from object storage and serves index lookups to queriers over gRPC. Without it, every querier downloads its own copy of the index, multiplying S3 GET requests and disk usage by the number of queriers.
Compactor. Merges the per-ingester index files in object storage into one tidy index per day per tenant. Also enforces log retention (deletes expired chunks) and processes deletion requests. Usually a singleton.
Ruler. Evaluates recording rules and alerting rules expressed in LogQL. Effectively a cron-driven query runner that emits Prometheus metrics or fires alerts.
Bloom planner / builder / gateway (experimental). Builds bloom filters over structured metadata to accelerate “needle in a haystack” queries. Only worth it at very high ingest volumes (Grafana cites 75 TB/month as the threshold).
The deployment shape
Targets. Loki is built as a single binary that behaves differently depending on -target=. Possible targets: all (run everything), distributor, ingester, querier, read, write, backend, etc. This is how the same binary becomes a monolith or a microservice cluster.
Deployment modes. Three official modes: monolithic (-target=all, one process does everything, fine up to ~20 GB/day), simple scalable deployment / SSD (three roles — read/write/backend — being deprecated before Loki 4.0), and distributed microservices (each component as its own deployment, recommended for >1 TB/day). As of Loki 3.x, the official recommendation is monolithic for small setups and microservices for everything else.
Hash ring. The membership protocol that lets distributors find ingesters. Uses Memberlist (gossip) by default; can also use Consul or etcd. Each ingester registers tokens (random 32-bit numbers) into the ring; the distributor hashes a stream’s labels and walks clockwise to find the responsible ingester(s).
Replication factor. How many ingesters each stream is written to in parallel. Typical values are 2 or 3. The distributor returns success when a quorum (floor(RF/2) + 1) of ingesters acknowledges. RF=3 with quorum=2 is the standard “survive one ingester failure” config.
Write-Ahead Log (WAL). Local disk log on each ingester that captures every push before it’s acked. If the ingester crashes, the replacement replays the WAL on startup to recover the in-memory state. With WAL + replication factor, Loki can tolerate ingester crashes without losing acked writes.
Tenant. The multi-tenancy unit. Loki is multi-tenant by default in microservices mode — every request must carry an X-Scope-OrgID header, and all data (in memory and at rest) is partitioned by tenant. In auth_enabled: false mode (common for small deployments), everything goes to a tenant named fake.
The query layer
LogQL. The query language. Inspired by PromQL. Two flavors: log queries return log lines; metric queries wrap log queries in aggregations (rate(), count_over_time(), quantile_over_time()) and return numeric time series suitable for Grafana panels and alerts.
Stream selector. The mandatory {label=value, ...} part of every LogQL query. This is what hits the index.
Log pipeline. The optional chain of operations applied after the stream selector. Includes line filters (|=, !=, |~, !~ — the actual grep step), parsers (| json, | logfmt, | pattern, | regexp), label filters (filter on extracted labels), and formatters (| line_format, | label_format).
Stream selector = “which haystack.” Line filter = “what to grep for in the haystack.” Parser + label filter = “extract structured fields from each line and filter on them.”
Read these three sentences again. The whole of LogQL is structured around them.
4. The Distilled Introduction
Here’s everything a tutorial would teach you, compressed.
Setting up
For learning or small-scale use, run Loki in monolithic mode with Docker:
docker run -d --name=loki -p 3100:3100 grafana/loki:latest
That’s a working Loki on port 3100 with sensible defaults and local filesystem storage. For anything past learning, you want it on Kubernetes via the official Helm chart:
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki -n loki --create-namespace \
-f values.yaml
The single most important config decision is what storage backend chunks and index land on. In production this is essentially always object storage — S3, GCS, or Azure Blob. The relevant values.yaml section:
loki:
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb # Always use tsdb. boltdb-shipper is legacy.
object_store: s3
schema: v13 # v13 is required for structured metadata.
index:
prefix: index_
period: 24h
storage:
type: s3
s3:
endpoint: s3.us-east-1.amazonaws.com
bucketnames: my-loki-chunks
region: us-east-1
store: tsdb and schema: v13 are the right answers for any new deployment in 2025+. Older guides and Stack Overflow answers will mention boltdb-shipper and schema: v11 — that’s deprecated. Pick TSDB.
Sending logs in
Loki doesn’t tail files. You need an agent. The current recommendation is Grafana Alloy (the official OpenTelemetry-flavored collector that replaced Promtail). Promtail still works and is everywhere in existing deployments, but it’s effectively in maintenance.
A minimal Alloy config to tail a local file and add a couple of static labels:
local.file_match "logs" {
path_targets = [{"__path__" = "/var/log/myapp.log"}]
}
loki.source.file "files" {
targets = local.file_match.logs.targets
forward_to = [loki.write.default.receiver]
}
loki.write "default" {
endpoint {
url = "http://loki:3100/loki/api/v1/push"
}
external_labels = {
cluster = "prod",
app = "myapp",
}
}
Two labels. That’s deliberate. We’ll come back to why.
On Kubernetes, you generally won’t write this by hand — you’ll deploy the k8s-monitoring Helm chart, which sets up Alloy as a DaemonSet to ship every container’s stdout/stderr to Loki with sensible labels (namespace, pod, container, app, etc.) auto-applied from Kubernetes metadata.
Writing your first query
Open Grafana, add Loki as a data source pointing at http://loki:3100, then go to Explore. A LogQL query has two parts: the stream selector (mandatory) and the log pipeline (optional).
The simplest possible query:
{app="myapp"}
Returns all logs from streams where the label app equals myapp. Already this is doing real work: the index lookup finds every chunk with app="myapp" in the time range, the queriers pull those chunks from object storage (or cache), decompress, and stream the lines back.
Now narrow with a line filter:
{app="myapp"} |= "error"
|= is “contains this substring.” This is the most common Loki query pattern in practice: a tight stream selector to narrow the haystack, then a substring filter for the actual needle. The grep step is parallelized across queriers and is the fastest operation Loki does.
Filter operators chain naturally:
{app="myapp"} |= "error" != "EOF" |~ "timeout|deadline"
That’s “contains ‘error’, does not contain ‘EOF’, matches the regex ‘timeout|deadline’.” Filters apply left-to-right. Put the most selective filter first — every line that fails a filter saves work for the next one.
Parsing log lines
If your logs are structured (JSON, logfmt, or a known pattern), Loki can extract fields at query time and let you filter on them:
{app="myapp"} | json | status_code >= 500 | latency > 1s
What happens:
- Stream selector grabs the right chunks.
- The
| jsonparser parses each line as JSON and extracts top-level fields as temporary labels. status_code >= 500andlatency > 1sfilter on the extracted labels.
The extracted labels exist only for the duration of the query — they aren’t stored. This is the right place for high-cardinality data: don’t index it, parse it on read. This is a core Loki idea and worth internalizing.
Other parsers:
| logfmtforkey=valuestyle logs (Go services love this).| pattern "<ip> - - <_> \"<method> <path>\""for fixed-format logs (much faster than regex).| regexp "..."for the messiest cases (slowest; use as a last resort).
Metric queries
Wrap a log query in an aggregation and you get a Prometheus-style time series:
sum by (status_code) (
rate({app="myapp"} | json [5m])
)
This counts log lines per second, grouped by status_code, over rolling 5-minute windows. The output looks exactly like a Prometheus metric and can be graphed in Grafana, alerted on by the Loki ruler, or fed into recording rules.
The available range aggregations: rate, count_over_time, bytes_rate, bytes_over_time, sum_over_time, avg_over_time, quantile_over_time, min_over_time, max_over_time. Combined with unwrap (extract a numeric value from a parsed label), you can compute things like “p99 of the duration field over a 5-minute window.”
Alerting
The ruler component reads a YAML file (or grabs one from object storage) full of rule groups, evaluates them on a schedule, and either emits Prometheus metrics (recording rules) or fires alerts (alerting rules) to Alertmanager. Example:
groups:
- name: app-errors
rules:
- alert: HighErrorRate
expr: |
sum by (app) (
rate({namespace="prod"} |= "ERROR" [5m])
) > 10
for: 5m
labels:
severity: critical
This is identical in shape to a Prometheus alert. The only difference is the query language; the wiring, evaluation cadence, and dispatch via Alertmanager are the same.
LogCLI
LogQL from the command line is logcli. The binary ships with Loki and is the right tool for scripting, bulk export, or anything you don’t want to do through Grafana:
export LOKI_ADDR=http://loki:3100
logcli query '{app="myapp"} |= "error"' --since=1h --limit=1000
This matters more than it sounds: logcli is how you do ad-hoc investigation without context-switching to a browser, and how you pipe Loki output into awk, jq, or sort | uniq -c.
What you’ve now learned
You can install Loki, ship logs to it with Alloy, write LogQL queries (selector + pipeline + filters + parsers + metric aggregations), and set up alerting rules. That’s a working Loki user. Now we earn the rest of the document by digging into why it works the way it does — the Mental Model section is the payoff.
5. The Mental Model
Four core ideas. Internalize these and most of Loki’s behavior becomes predictable instead of mysterious.
Core idea 1: The label set IS the index. The log content is opaque to the index.
This is the foundational Loki idea and Section 1 in one sentence again. When Loki receives {app="payments"} my log content, only {app="payments"} makes it into the index. The text my log content is gzipped (or snappy’d) into a chunk and forgotten by the index forever. The chunk knows what stream it belongs to; that’s the only link.
What this predicts:
- Querying by anything other than labels means scanning chunks. There is no inverted index — there is no fast path for “give me every log containing this word.” Loki will grep, in parallel, but it has to actually read every chunk in the selector. This is why
{namespace="prod"} |= "error"is fast and{cluster=~".+"} |= "error"is slow — the selector decides how many chunks get scanned. - Adding a new label retroactively is impossible. Labels are baked into chunks at ingest time. You can’t
ALTER TABLE ADD COLUMN. If you want a new label, you set it from now forward; old data keeps its old labels forever. - Parsing-at-query-time (
| json | foo="bar") is the only way to filter on something that wasn’t a label at ingest. Loki re-parses the entire chunk every time. This is fine — it’s parallelized, it’s fast at human scale, and it keeps your index small — but it means the same parsing work happens on every query, not once at ingest.
Core idea 2: Stream count is everything. Cardinality is the silent killer.
Every unique combination of label values is a separate stream. Every stream is held in memory on the ingesters. Every stream has its own active chunk being built. Every stream produces its own series of small immutable chunks in object storage. The number of streams determines:
- Ingester memory. Each active stream holds a chunk in RAM. 100k streams ≈ several GB of ingester RAM.
- Object storage object count. More streams means more chunks per unit time means more S3 PUT requests and more S3 GET requests at query time. Lots of small chunks is the worst possible storage layout.
- Index size. The TSDB index grows roughly linearly with stream count. A large index slows down every query’s planning phase.
- Query parallelism ceiling. Loki shards queries by stream; with too many tiny streams you get scheduling overhead; with too few large streams you can’t parallelize.
What this predicts:
- A label like
user_id(millions of values) is a cardinality bomb. It will OOM your ingesters and shred your query latency. Even something that seems innocent —pod_ipin an environment with frequent pod restarts, orversionin a service that redeploys hourly — can balloon over time. The hashes accumulate; old streams hang around until idle timeout. - A high-cardinality label that seems low-cardinality at design time is the classic trap.
customer_idlooks fine in dev with 5 customers; in production with 50,000, it’s a disaster. - The “stream count” metric (
loki_ingester_memory_streams) is the canary. Watch it. If it grows unbounded, you have a cardinality problem and you need to find it now, not in a week. - This is also why Grafana keeps lowering the default
max_label_names_per_series(now 15 in v3.0, was 30 in v2.x). They want to make the wrong thing harder.
Core idea 3: The write path lives in memory; the read path lives in object storage.
The ingester is the only place where data exists between push-ack and flush-to-S3. While a chunk is being built, it’s in RAM. When the chunk is “full enough” (size threshold ~1.5 MB compressed, idle for 2 min, or max age 30 min — whichever comes first), it’s compressed, flushed to S3, and the in-memory copy is retained briefly for in-flight reads. The WAL on local disk is the recovery path if the ingester crashes before flush.
Meanwhile, the read path is the opposite: queriers fetch chunks from object storage, decompress them in memory, and scan. The hot path is the chunk cache (typically memcached) — at a 95%+ hit rate the read path basically never touches S3.
What this predicts:
- Ingester crashes are dangerous. Without WAL or replication, anything not yet flushed is lost. With both (WAL + RF=2+), Loki tolerates one ingester loss without data loss.
- Object storage is the disaster recovery boundary. Lose your ingesters and their disks: you lose at most ~30 minutes of data. Lose your object storage bucket: you lose everything. Back up your bucket. (Most cloud object stores have versioning and cross-region replication — turn them on.)
- Ingester sizing is memory-bound. People reflexively give ingesters tons of CPU; they’re actually memory-bound workloads. Production ingesters typically run at 16+ GB of RAM and very modest CPU.
- Read latency is dominated by S3 + decompression, not Loki itself. Caching is the entire performance story for queries. Skimp on memcached and your P99 query latency goes to garbage. The DEV community post quotes a 97.8% chunk cache hit rate; that’s the kind of number a production Loki needs.
Core idea 4: Loki queries are distributed grep, parallelized by time and by stream.
The query frontend splits a time-range query into many smaller time-range queries. The query scheduler distributes those across queriers. Each querier, for its slice, asks the index gateway “which chunks match this stream selector?”, fetches those chunks (cache → S3), decompresses, and runs the line filter / parser / label filter pipeline. Results stream back up to the frontend, which merges them.
What this predicts:
- More queriers = more parallelism = faster wide queries. The biggest performance lever after caching is querier count and
split_queries_by_interval(default 15 min). Set the latter too high and you under-parallelize; set it too low and you over-schedule and lose to overhead. 15m is a good starting point. - The selector controls the cost.
{cluster="prod"}over 24 hours might be thousands of chunks per shard.{cluster="prod", namespace="payments", app="api"}over 24 hours might be a few chunks per shard. Same wall-clock duration; orders of magnitude difference in work. - Line filters are free; parsers are expensive. A line filter is a byte-level substring or regex check that short-circuits the rest of the pipeline. A parser has to JSON-decode every line that passed the filter. Order matters:
{...} |= "error" | jsonis much faster than{...} | json |= "error". - Recording rules are the right answer for repeated expensive queries. If a dashboard runs the same
quantile_over_timequery every 30 seconds, that’s wasted work. Compute it once via a recording rule, store the result as a Prometheus metric, and query that.
6. The Architecture in Plain English
Walk through a write, then walk through a read.
A log line, end to end on the write path
Your application logs a line. An Alloy agent on the same node tails the file (or scrapes the container stdout), enriches it with labels from Kubernetes metadata (namespace, pod, container, app), batches it with other lines, and POSTs a protobuf-encoded push request to a distributor over HTTP/gRPC.
The distributor accepts the request and does a few things in sequence. First, validation: are the labels well-formed? Is the line within size limits? Is the timestamp within the acceptable window (not too old, not in the future)? Is the tenant within its rate limits? If anything fails, the line is dropped with a structured error and an incrementing loki_discarded_samples_total counter.
If validation passes, the distributor normalizes the label set (sorting it deterministically) and hashes the combination of (tenant_id, sorted_label_set). It uses that hash to look up the responsible ingesters in the hash ring — specifically, replication_factor ingesters (usually 3) by walking clockwise around the ring. It forks the request to all RF ingesters in parallel.
The ingesters receive the line. Each ingester finds the in-memory stream object matching the label set (or creates a new one — this is when stream count goes up), appends the line to the stream’s active chunk, and writes an entry to the WAL on local disk. Once the WAL fsync (or batched fsync) returns, the ingester acks the push to the distributor.
The distributor waits for a quorum of acks (floor(RF/2)+1 — so 2 of 3 for RF=3) and returns success to Alloy. If quorum fails, the distributor returns an error and Alloy retries with backoff.
In the background, ingesters run a flush loop. Every chunk is checked: has it reached its target size (~1.5 MB compressed, controlled by chunk_target_size)? Has it been idle for chunk_idle_period (default 30 min, often reduced to 2-5 min in production)? Has it exceeded max_chunk_age (default 2h, often set to 30 min)? If yes to any, the chunk is finalized, compressed (snappy by default), uploaded to object storage as an immutable blob keyed by (tenant, stream_hash, time_range), and the corresponding TSDB index file is written. The in-memory chunk is retained for chunk_retain_period (default 1 min) so in-flight queries can still hit it directly.
Periodically, the compactor wakes up, downloads all the per-ingester TSDB index files for a given day and tenant, merges them into one canonical index file, uploads it back, and deletes the originals. It also processes retention — for each tenant, it walks the index, identifies chunks past the retention horizon, and deletes them from object storage.
That’s the write path. The state lives in the ingesters until flushed; from there, it lives in object storage forever (or until retention deletes it).
A query, end to end on the read path
A user opens Grafana and runs {app="payments", namespace="prod"} |= "error" | json | duration > 1s over the last hour. Grafana hits the query frontend.
The query frontend does its splitting trick: the 1-hour range becomes four 15-minute subqueries (using the default split_queries_by_interval). It also checks its result cache — if any of these subquery ranges have been answered recently with stale-tolerant results, it serves them from cache and skips them. The remaining subqueries get put on a queue.
The query scheduler (if present) pulls subqueries off the queue and hands them out to queriers as workers free up. The scheduler enforces tenant-level fairness so one user’s expensive 24-hour query can’t starve everyone else’s 5-minute one.
A querier picks up a subquery. First it asks the index gateway for the chunk references matching the stream selector {app="payments", namespace="prod"} in that subquery’s time range. The index gateway has the TSDB index files cached in memory (or on its local disk); it returns a list of chunk IDs in milliseconds. The querier also makes a gRPC call to the ingesters directly, asking for any in-memory chunks for that selector that haven’t been flushed yet.
For each chunk ID, the querier checks the chunk cache (memcached). On a hit, it gets the decompressed chunk bytes back fast. On a miss, it fetches the chunk from object storage, decompresses it, and writes the decompressed bytes to memcached.
Now the querier has a stream of log lines from each chunk. It runs the pipeline:
- Line filter
|= "error": byte-level substring match on each line. Lines without “error” are discarded immediately. This is typically a 10-100x reduction in line count, and the parser never sees the discarded lines. - Parser
| json: for each surviving line, parse as JSON and extract top-level fields as temporary labels. - Label filter
| duration > 1s: filter on the extracteddurationlabel. Loki parses “1s” → 1 second, parses each line’sdurationvalue, compares. Lines that fail are discarded; lines where the conversion errors out get a__error__label and are kept (so you can see what failed).
Surviving lines, with their original labels (and any extracted ones if you used | label_format to promote them), are streamed back to the query frontend.
The query frontend collects all subquery results, merges them in order (deduplicating any lines that came from both an ingester and a flushed chunk — this is why every line has a stream ID and nanosecond timestamp), and returns the result to Grafana.
For a metric query (one with an aggregation like rate() or count_over_time()), the same pipeline runs but each querier returns aggregated numeric samples per subquery range instead of raw lines, and the frontend stitches the samples into a Prometheus-style time series.
That’s the read path. State lives in object storage; queriers are stateless workers; caching is what makes it fast.
7. The Things That Bite You
These will surprise you in your first 6-12 months. Each one connects to the mental model.
Gotcha 1: A “harmless-looking” label silently 10x’d your stream count.
You added level because dashboards wanted to color-code by log level. Now every stream is split four ways (info/warn/error/debug). You added pod because you wanted to query a specific pod. Now every pod restart creates a new stream (and pods restart constantly in Kubernetes). You added version because release tracking. Now every deploy doubles your active streams for a few hours.
What you expected: Adding a label is a query-time convenience.
What actually happens: Adding a label is a cardinality multiplier. Loki has to track every unique combination forever (until idle). This connects directly to Core Idea 2 (stream count is everything).
How to handle it: Be parsimonious with labels at ingest. Use |= "level=error" as a line filter instead of level=error as a label. Use | json | pod="foo-7f8b9c" as a parsed filter for one-off queries. Reserve labels for the 5-10 dimensions you query frequently and that have bounded value sets.
Gotcha 2: Out-of-order writes used to be rejected; now they aren’t (but only sort of).
In Loki 2.4 and earlier, every log line had to arrive in strictly increasing timestamp order per stream. Networks being networks, this caused real pain — Promtail had to buffer and reorder, agents had to be careful, late-arriving data was dropped. As of Loki 2.4+, out-of-order writes are accepted by default within a 2-hour window.
What you expected: Out-of-order is now a non-issue.
What actually happens: Within the window, it works fine. Outside the window, lines are still rejected with too far behind. The “too far behind” threshold is per-stream, measured against the highest timestamp Loki has seen for that stream. If a stream is very fast and one log line is delayed by more than 2 hours, it’s dropped silently (well, with a discarded counter increment).
How to handle it: Watch loki_discarded_samples_total{reason="too_far_behind"} and out_of_order. If non-zero, investigate clock skew, log shipper buffering, or batch jobs replaying old data. For batch ingestion of historical data, either disable the check temporarily or split into streams with separate ingestion order via labels (e.g., {...} → {..., batch="historical-2024-08"}).
Gotcha 3: A wide-open selector tanks the cluster.
A new engineer writes {cluster=~".+"} |= "error" to “find all errors everywhere.” This selects every stream in the cluster. The query frontend dutifully splits it into shards. Each querier loads chunks from hundreds of streams. The cache thrashes. Memory spikes. Other tenants’ queries queue up behind this one. You get paged.
What you expected: Loki is “designed to be cheap” and handles wide queries.
What actually happens: Wide selectors defeat the whole point of Loki’s design — they turn every query into “scan everything.” Loki’s costs scale with chunks loaded, not lines returned. A selector matching 10,000 streams over 24 hours might touch millions of chunks. Connects to Core Idea 4 (the selector controls the cost).
How to handle it: Set max_query_series and max_query_length per tenant. Educate users to always anchor on namespace or app first. Use shuffle sharding and per-tenant query parallelism limits to contain blast radius from one user’s bad query.
Gotcha 4: The compactor is a singleton and a single point of slowness.
The compactor is usually one pod. It downloads index files, merges them, uploads, and also enforces retention. If it falls behind — because of slow object storage, a tenant with a huge index, or a deletion request backlog — your indexes stay fragmented (slower queries) and your retention horizon drifts (logs you thought were deleted aren’t).
What you expected: Compaction is background; doesn’t affect query latency.
What actually happens: A backed-up compactor degrades query performance across the board because queries have to merge many small index files instead of one big one. And retention drift is a compliance landmine.
How to handle it: Monitor loki_compactor_oldest_pending_delete_request_age_seconds and the compactor’s loop duration. Horizontal scaling of the compactor was added recently (split work by tenant); use it for large multi-tenant clusters. Give the compactor real CPU and memory; it’s not free.
Gotcha 5: WAL on a network-attached disk is a foot-cannon.
Helm chart deploys with EBS gp3 by default. But if you accidentally end up on EFS or NFS for ingester WAL (because of “shared storage simplicity”), fsync latency goes through the roof. Push latency climbs. Ingesters fall behind on flush. Memory builds up. OOM.
What you expected: Storage is storage. What actually happens: The WAL needs low-latency local disk. Network-attached storage adds 10-100x the fsync latency. How to handle it: Ingester WAL volumes must be local SSDs (EBS gp3/io2, GCE pd-ssd, Azure premium SSD). Index gateway and compactor can be on shared storage (EFS) because they’re read-heavy. Don’t mix these up.
Gotcha 6: “Why are my logs missing from the last 30 minutes?”
A user runs a query that includes the current time and finds the last 30 minutes empty. They panic. The logs are actually fine — they’re just still on the ingesters, not yet flushed to object storage, and the querier has to query both the ingesters and the store. If the query path can’t reach the ingesters (network issue, ingester restart, address misconfiguration), it falls back to just the store and the recent data appears missing.
What you expected: Queriers are queriers; they read what’s there.
What actually happens: Recent data lives only on ingesters until flush. The querier must combine ingester + store results. Connects to Core Idea 3 (write path in memory; read path in object storage).
How to handle it: Watch loki_distributor_ingester_appends_total and the ingester health in the ring. If queriers can’t reach an ingester, queries silently return partial results — Loki doesn’t fail loudly here. Set up the Loki Canary (a sidecar that writes known sequences and reads them back) so you get an explicit signal when recent data goes missing.
Gotcha 7: Schema migrations are real, and they’re not zero-downtime.
You decide to migrate from boltdb-shipper to tsdb, or from schema v11 to v13. The right way: add a new schema config block with a future from: date, let writes start landing on the new index while old reads still use the old index. Until all old data has aged out of retention, queries that span the boundary have to hit both indexes.
What you expected: “Migrate to TSDB” is a setting change.
What actually happens: It’s a dual-write period that can last the length of your retention window. Plan for it.
How to handle it: Read the official migration guide. Set the from: date in the new schema config to the future (next day, not today). Validate that writes are landing on both indexes briefly during the cutover. Don’t try to migrate during an incident.
Gotcha 8: auth_enabled: false is multi-tenancy theater.
The default config disables auth and routes everything to a tenant called fake. This is fine for development, dangerous in production. When you eventually decide you want multi-tenancy, you have to flip auth_enabled: true, regenerate every agent’s config to send X-Scope-OrgID, and — critically — your old fake tenant data is not migrated. It just becomes a tenant called “fake” forever, separate from all your new tenants.
What you expected: You can turn on auth later.
What actually happens: Old data lives under tenant fake, new data lives under whatever tenant you specify, and there’s no rename.
How to handle it: Decide on tenancy strategy up front. Even if you only have one tenant, set auth_enabled: true and use a meaningful tenant name from day one.
8. The Judgment Calls
These are the decisions that separate “ran the Helm chart” from “actually understands Loki.”
Judgment 1: Loki vs. Elasticsearch — which to reach for.
Loki is right when: Your queries are mostly “I know the service/cluster/namespace, show me its logs around this incident” + occasional grep. You want low operational cost. You want object-storage-only pricing economics (~10x cheaper than Elasticsearch at scale). You’re already on Prometheus + Grafana and want the same operational model. Your logs are mostly text and you don’t need to run analytics on them.
Elasticsearch is right when: You genuinely need full-text search across all logs (“which user typed DROP TABLE?”), and you need it sub-second across long time ranges. You’re doing log analytics, not log debugging — pivots, aggregations, joins on log content. You have security/SIEM requirements that need indexed-everything for investigations. You have the budget and headcount for it.
The signal: If 90% of your queries start with a label filter, Loki. If 90% start with arbitrary text, Elasticsearch. There is no shame in admitting that your team’s investigation patterns are the second kind — Loki will hurt for that workload.
Judgment 2: Monolithic vs. SSD vs. microservices.
Monolithic (-target=all): Up to ~100-200 GB/day, single team, no multi-tenancy, low operational appetite. Run two of them for HA with replication_factor: 3 and shared object store. This works for far more deployments than people admit.
Simple Scalable Deployment (SSD): The Helm chart default historically — three roles (read/write/backend). Being deprecated before Loki 4.0. If you’re starting fresh, skip SSD; go straight to microservices. If you’re already on SSD, migrate to distributed when you have the cycles.
Microservices (distributed): Above ~500 GB/day or multi-tenant. Each component as its own deployment. More moving parts, but each piece scales independently. Grafana Labs runs Loki this way internally.
The signal: Are you single-tenant, under 1 TB/day, and your operations team is two people? Monolithic. Multi-tenant or >1 TB/day? Microservices. Don’t bother with SSD for new clusters.
Judgment 3: How many labels, and which ones.
Start with the bare minimum. For Kubernetes, the irreducible set is roughly: cluster, namespace, app (or service_name), container. That’s it. Maybe add environment (prod/staging/dev) and region. That’s six labels, max.
Resist adding labels you “might want to query someday.” Each label multiplies your stream count by its cardinality. Each label adds latency to the index lookup. The right place for high-cardinality fields is structured metadata (Loki 3.0+) or parsed-at-query-time.
The signal: If your loki_ingester_memory_streams is more than ~10,000 per ingester, you have too many labels or one with too much cardinality. Hunt it down with topk(20, count by (label_name) (count_values_over_time(...))) and consider whether it can become structured metadata instead.
Judgment 4: Retention period.
The temptation is to set retention to “forever” or “as long as possible.” Resist.
Short retention (3-7 days): Loki as a debugging tool. Everything older goes to cold storage (S3 lifecycle to Glacier) or another system. Queries stay fast, indexes stay small, costs stay predictable. This is the right default.
Medium retention (30-90 days): Compliance starts mattering, on-call wants historical context, occasional “this regression happened over the last month” investigations. Costs go up sub-linearly because object storage is cheap, but queries over the whole range get slow.
Long retention (1+ year): Compliance, security forensics, billing investigations. Loki can do this, but you’re paying for it in storage AND in slow queries across that range. Consider a separate “archive” tier or a different system entirely.
The signal: If most queries are within the last 7 days but you’re paying to keep 365 days hot, you’re wasting money. Two-tier with a shorter retention window for the active store and S3 lifecycle for archives is usually the right answer.
Judgment 5: Replication factor.
RF=1: Fine for single-binary monolithic. No HA, but the WAL gives you durability for ingester crashes (you’ll lose ~30s of in-flight data on restart). Not suitable for distributed setups.
RF=2: The pragmatic choice. Survives one ingester loss. Halves your ingester memory overhead vs. RF=3. The dev.to production post calls this the “sweet spot” and they’re not wrong — for most non-critical workloads, this is enough.
RF=3: Production default for critical workloads. Survives two simultaneous ingester losses; quorum still met with one ingester down. 50% more ingester memory than RF=2 for marginal additional safety.
The signal: If logs being missing for 5 minutes after an ingester crash is acceptable to your incident response process, RF=2. If logs are part of your compliance audit trail or your SLO, RF=3.
Judgment 6: Which agent — Alloy, Promtail, Fluent Bit, or OTel Collector.
Grafana Alloy: The current recommendation. Successor to Promtail. Supports both Loki push and native OTLP, plus Prometheus scraping, plus Pyroscope profiles. If you’re starting fresh in 2025+, this is the answer.
Promtail: Still very widely deployed. Still works. In maintenance mode. Migrate to Alloy when convenient, not urgent.
Fluent Bit: Lower memory footprint than Alloy, broad ecosystem (not just Grafana). Good for resource-constrained edge or embedded scenarios. The community Loki plugin works; the official one is mature.
OTel Collector: Right if you’re committed to OpenTelemetry across the org for everything (traces, metrics, logs). Loki 3.0+‘s native OTLP endpoint means you don’t need a Loki-specific exporter anymore.
The signal: Are you on Grafana’s stack with no strong opinions? Alloy. Already deployed Promtail and it works? Don’t migrate yet. Standardizing on OTLP org-wide? OTel Collector. Need the smallest possible agent? Fluent Bit.
Judgment 7: Labels vs. structured metadata vs. parse-at-query-time.
For any given field in your logs, three places it can live:
Index labels: Static, bounded cardinality, queried frequently. app, namespace, level if you really query by level a lot. Goes into the TSDB index.
Structured metadata: Per-line, possibly high-cardinality, queryable but not indexed. trace_id, request_id, user_id if you frequently search for specific values. Stored inside the chunk. Bloom filters can accelerate lookups (in distributed mode at large scale).
Parse-at-query-time: Fields embedded in the log line that you parse with | json or | logfmt when querying. Cheap to add (zero ingest cost), expensive per query (re-parsed every time). The default for anything you don’t query often.
The signal: “I want to query by this field every five minutes” → label (if low-cardinality) or structured metadata (if high-cardinality). “I rarely query by this field but need to occasionally” → parse-at-query-time. The cost of a label is paid forever; the cost of parsing is paid only on the queries that need it.
Judgment 8: Use bloom filters or not?
Bloom filters (Loki 3.0+, experimental, requires distributed mode) accelerate “needle in a haystack” queries on structured metadata. They’re a separate component layer (Bloom Planner, Builder, Gateway) that builds probabilistic data structures over your high-cardinality fields.
Use them when: You’re ingesting >50-75 TB/month, you have trace IDs / request IDs / customer IDs as structured metadata, and your dominant query pattern is “find this one needle across a huge time range.” Grafana Cloud uses this for its OTel customers.
Don’t bother when: You’re under that scale. The build cost (CPU, storage for the bloom blocks) is significant; the payoff only kicks in when you’re trying to scan terabytes of chunks per query. For most deployments, label discipline and good caching are way more impactful.
The signal: Read the bloom filter docs only if you’ve already optimized labels, caching, and query patterns and you’re still seeing multi-minute queries on structured-metadata lookups.
Judgment 9: Single tenant vs. multi-tenant.
Single tenant: Simpler config (auth_enabled: false works), no header propagation needed at agents, no per-tenant limits to manage. All operations apply globally. Right for one-team or one-environment Loki instances.
Multi-tenant: Required for any shared platform. Each team gets a tenant ID, agents must send X-Scope-OrgID, per-tenant limits enforced (so one bad team can’t take down the cluster). Multiplies operational complexity — every limit has a per-tenant override file, every metric is sliced by tenant, every dashboard needs a tenant variable. But it’s the only way to safely give multiple teams access to one Loki cluster.
The signal: Will more than one team write to this Loki? Multi-tenant from day one. The migration from “fake tenant” later is genuinely painful.
Judgment 10: When to leave Loki entirely.
Loki is wrong for you if:
- You need sub-second arbitrary text search across petabytes (Elasticsearch).
- Your logs are your primary product surface (Datadog, Sumo Logic, Splunk — pay the money, get the UX).
- You need to join logs with other data sources at query time as a primary workflow (ClickHouse, Honeycomb).
- Your team genuinely can’t operate distributed systems on Kubernetes and the cost of someone learning is higher than just paying Datadog.
- You have no Prometheus / Grafana investment and adding both plus Loki is more change than the team can absorb.
Loki is the right answer for most engineering orgs that already run Prometheus + Grafana, but it’s not the right answer for all of them. Be honest about the workload.
9. The Commands/APIs That Actually Matter
The 20% of LogQL and operational interfaces you’ll use 80% of the time, grouped by task.
Building queries
Simple selection:
{namespace="prod", app="payments"}
With a line filter (the most common query in real life):
{namespace="prod", app="payments"} |= "error"
{namespace="prod"} |= "ERROR" != "expected" |~ "timeout|deadline"
Parsing JSON and filtering on extracted fields:
{app="api"} | json | status >= 500 | duration > 1s
Parsing logfmt (common for Go services):
{app="api"} | logfmt | level="error" | duration > 1s
Pattern parser (much faster than regex when the format is fixed):
{app="nginx"} | pattern `<ip> - - <_> "<method> <path> <_>" <status> <size>` | status="500"
Reformatting output to focus on what matters:
{app="api"} | json | line_format "{{.timestamp}} {{.user_id}} {{.duration}}ms {{.status}}"
Metric queries (for dashboards and alerts)
Rate of errors per service:
sum by (app) (rate({namespace="prod"} |= "ERROR" [5m]))
p95 latency from logged durations:
quantile_over_time(0.95,
{app="api"} | json | unwrap duration_ms [5m]
) by (route)
Count of distinct values (cardinality check):
count by (app) (count_over_time({namespace="prod"}[1h]))
Top-K offenders:
topk(10, sum by (pod) (rate({app="api"} |= "panic" [1h])))
LogCLI
export LOKI_ADDR=http://loki:3100
export LOKI_ORG_ID=tenant-1 # If auth_enabled: true
# Tail logs live (like `tail -f` for a service)
logcli query '{app="api"}' --tail
# Last hour of errors, capped at 5000
logcli query '{app="api"} |= "error"' --since=1h --limit=5000
# Pipe to standard Unix tools
logcli query '{app="api"} |= "error"' --since=1h --limit=10000 \
| jq -r .timestamp | sort | uniq -c
# Show what labels and values exist (cardinality discovery)
logcli labels
logcli labels app
Operations: useful HTTP endpoints
# Ready check (use for liveness/readiness probes)
GET /ready
# Current configuration (sanitized — secrets redacted)
GET /config
# All known tenants
GET /api/v1/labels # also returns label names
# Push a log line directly (good for testing)
POST /loki/api/v1/push
{
"streams": [{
"stream": { "app": "test" },
"values": [["1700000000000000000", "hello from curl"]]
}]
}
# Get the ring state (which ingesters are healthy)
GET /ring
# Get distributor's view of its rate limits per tenant
GET /distributor/all_user_stats
Operations: must-watch metrics
# Cardinality canary — the single most important Loki metric.
sum(loki_ingester_memory_streams)
# Are we dropping logs? Group by reason to know why.
sum by (reason) (rate(loki_discarded_samples_total[5m]))
# Ingest rate, headline number.
sum(rate(loki_distributor_bytes_received_total[5m]))
# Write path latency. >500ms is concerning.
histogram_quantile(0.99,
sum(rate(loki_request_duration_seconds_bucket{route="loki_api_v1_push"}[5m])) by (le))
# Read path latency. >5s is concerning for interactive use.
histogram_quantile(0.99,
sum(rate(loki_request_duration_seconds_bucket{route=~"loki_api_v1_query.*"}[5m])) by (le))
# Chunk cache hit rate. Below 90% = pain.
sum(rate(loki_cache_hits{cache="chunks"}[5m]))
/ sum(rate(loki_cache_fetched_keys{cache="chunks"}[5m]))
# Ingester health from distributor's view.
sum(rate(loki_distributor_ingester_append_failures_total[5m])) # should be 0
# WAL corruption. Should always be 0.
sum(rate(loki_ingester_wal_corruptions_total[5m]))
The configuration knobs that actually matter
Twenty-plus pages of YAML reference, but these are the ones you’ll touch:
limits_config:
ingestion_rate_mb: 10 # Per-tenant MB/s. Set per workload.
ingestion_burst_size_mb: 20 # Burst allowance.
max_global_streams_per_user: 50000 # Cardinality ceiling. THE limit.
max_label_names_per_series: 15 # Enforce label parsimony.
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 168h # 7 days. Adjust.
split_queries_by_interval: 15m # Query parallelism granularity.
max_query_parallelism: 32 # Per-tenant query fan-out cap.
max_query_series: 500 # Cap series in a single query (cardinality safety).
ingester:
chunk_target_size: 1572864 # 1.5MB compressed. The default for a reason.
chunk_idle_period: 30m # Flush after N min idle.
max_chunk_age: 2h # Hard cap on chunk age.
chunk_encoding: snappy # Fast. Use this unless you're disk-bound.
wal:
enabled: true # Always.
checkpoint_duration: 5m
schema_config:
configs:
- from: "2024-01-01"
store: tsdb # Always.
object_store: s3
schema: v13 # Always for new clusters.
index:
prefix: loki_index_
period: 24h
10. How It Breaks
The dominant failure modes, what they look like, and what to do.
Failure 1: Ingester OOM
Symptoms: Ingester pods OOMKilled by Kubernetes. loki_ingester_memory_streams was climbing. Push latency spiked. Some logs may be lost (anything not yet WAL-checkpointed at the moment of death).
Root cause: Almost always cardinality explosion. Someone added a label with unbounded values. Connects to Mental Model 2 (stream count is everything).
Diagnose:
# Find which tenant exploded
topk(5, sum by (tenant) (loki_ingester_memory_streams))
# Then find which labels in that tenant have high cardinality
# (run against your Loki itself via Grafana Explore)
topk(20, count by (label_name) (
count_values_over_time({tenant_id="problem-tenant"}[1h])
))
Fix: Identify the offending label. Either remove it from the agent’s label set, or move the value into structured metadata. Restart the agents. Eventually the old stream entries time out (chunk_idle_period) and ingester memory recovers. In the meantime, give ingesters more RAM as a band-aid.
Failure 2: “Some of my logs are missing”
Symptoms: Recent logs (last few minutes) show in queries but disappear after a while. Or: logs from one pod aren’t appearing at all.
Root cause: Several possibilities, in rough order of likelihood:
- The agent isn’t sending — check Alloy/Promtail logs.
- The distributor rejected the lines — check
loki_discarded_samples_totalby reason. - An ingester is in an unhealthy state but still in the ring; the distributor wrote to it but the data was lost (only matters with RF=1).
- A querier can’t reach the ingester holding the data, so reads return partial results.
Diagnose:
# Are agents sending? Look at Alloy's metrics on its own /metrics endpoint.
loki_write_dropped_bytes_total
loki_write_sent_bytes_total
# Are distributors accepting?
rate(loki_distributor_bytes_received_total[5m])
# Are samples being dropped, and why?
sum by (reason) (rate(loki_discarded_samples_total[5m]))
# Is the ingester ring healthy?
curl http://loki/ring # All ingesters should be ACTIVE.
Fix: Depends on the cause. Use the Loki Canary sidecar to get a continuous signal — it writes known sequences and reads them back, alerting if any are missing or delayed.
Failure 3: Queries are slow (or timing out)
Symptoms: Grafana panels take forever. Queries time out. Users complain.
Root cause: Generally one of:
- The selector is too wide (touches too many streams).
- The time range is too long with no caching benefit.
- Cache hit rate has dropped (memcached pod restart, sizing issue).
- Queriers are under-provisioned.
- The compactor is behind, so the index is fragmented.
- Object storage is slow today.
Diagnose:
# Cache hit rate — should be > 95%
sum(rate(loki_cache_hits{cache="chunks"}[5m]))
/ sum(rate(loki_cache_fetched_keys{cache="chunks"}[5m]))
# Query queueing — are queriers saturated?
loki_query_scheduler_queue_length
# Per-query stats (Loki returns these in the response)
# Look for chunks_downloaded, bytes_processed, chunks_processed
# A query touching 10000s of chunks is suspect.
Fix:
- For wide-selector queries: educate users, set
max_query_series, addmax_query_parallelismper tenant. - For under-cached: scale memcached, check hit rates per cache tier.
- For under-provisioned: scale queriers, decrease
split_queries_by_intervalto parallelize more. - For compactor-behind: scale compactor, give it more CPU.
Failure 4: “I deleted logs but they’re still there” (and vice versa)
Symptoms: Retention policy says 7 days but you’re seeing 14-day-old data. Or: you deleted a tenant’s data but storage costs aren’t dropping.
Root cause: Loki’s deletions are logical. The compactor marks chunks as deleted in its tracking; the actual object-storage deletion happens on a separate schedule. If the compactor is behind, both retention and deletion lag.
Diagnose:
loki_compactor_oldest_pending_delete_request_age_seconds
loki_compactor_pending_delete_requests_count
# How old is the most recent compaction?
time() - loki_compactor_apply_retention_last_successful_run_timestamp_seconds
Fix: Scale the compactor. Check its logs for errors against object storage. If compaction is fundamentally too slow, look at horizontal compactor scaling (newer Loki feature) to split by tenant.
Failure 5: WAL replay takes forever on restart
Symptoms: Ingester restarts and takes 20+ minutes to come back. During that time, the ring shows the ingester as UNHEALTHY and writes are routed to others.
Root cause: WAL too large — too much data accumulated since last checkpoint. Either checkpoints are too infrequent (checkpoint_duration too high) or the previous shutdown was abnormal.
Fix: Set checkpoint_duration to 1-5 minutes. Provision enough WAL disk space (10s of GB per ingester). Don’t OOMKill ingesters mid-write; use terminationGracePeriodSeconds generously so they can flush gracefully.
The debugging workflow when you don’t know what’s wrong
A general triage sequence:
/ring— Are all ingesters ACTIVE? An ingester in LEAVING or UNHEALTHY state is your prime suspect.loki_panic_total— Anything non-zero is “go read the logs of that pod immediately.”loki_distributor_bytes_received_total— Is ingest still happening at expected rate?loki_discarded_samples_totalby reason — Is anything being dropped silently?loki_ingester_memory_streams— Is cardinality stable or climbing?- Cache hit rates — Did caching collapse?
- Query queue length — Are queries piling up?
- Object storage latency (your cloud provider’s metric, not Loki’s) — Is S3 slow today?
This sequence catches >90% of real Loki incidents in ten minutes.
11. The Downsides / Disadvantages
Loki is a good system. It is also a trade — the elegance of “index labels, grep chunks” comes with structural costs that don’t go away with experience or better config. Here’s the honest accounting.
Downside 1: Arbitrary text search is fundamentally slow.
The downside: If you don’t know which labels to filter on, Loki is a much worse experience than Elasticsearch. “Find every log containing this string anywhere in the last 7 days” is a query Loki can answer, but it has to load and decompress every chunk in the time range and grep through it. Hours of compute for what Elasticsearch returns in seconds.
Where it comes from: Connects directly to Core Idea 1 — the deliberate choice not to index log content. The label index is tiny because the content index doesn’t exist. You can’t have both.
What it costs you in practice: Wide investigations require either patience (multi-minute queries), money (more queriers and cache to parallelize harder), or behavior change (educating engineers to always start with a label filter). Security teams often hate Loki because their workflows are inherently text-first — “show me every login attempt from this IP.”
Dealbreaker when: Your dominant query workflow is genuinely “I don’t know which service or namespace, just find the text.” If this describes you, you’ll fight Loki forever. Use Elasticsearch or ClickHouse.
Common workaround that doesn’t help: Adding more labels to “make queries selective.” This is just trading the original problem (slow grep) for a new one (cardinality explosion). It feels productive; it isn’t.
Downside 2: Cardinality is a sharp edge that cuts users repeatedly.
The downside: The boundary between “good label” and “cardinality bomb” is invisible from outside the system and easy to cross by accident. A pod label in a non-restarting StatefulSet is fine; the same label in a Kubernetes Deployment is not. A version label is fine if you deploy weekly, bad if you deploy hourly. There’s no compile-time check, no Helm chart guard, just operational pain.
Where it comes from: Stream count is the central scaling axis of the system (Core Idea 2), but the label set is configured at agents — sometimes by people who didn’t build the Loki cluster. The blast radius lives downstream from the decision.
What it costs you in practice: Real production teams (the dev.to post is a great example) actively police labels via reviews, dashboards, alerts on memory_streams, and runbooks. The cognitive load is permanent.
Dealbreaker when: Your org can’t enforce labeling discipline across all teams that ship logs. With three teams and a strong platform owner, you’ll manage. With twenty product teams autonomously configuring their own agents, expect chronic problems.
Downside 3: The operational surface is wide.
The downside: A production-grade Loki has at least 7 distinct components (distributor, ingester, querier, query frontend, query scheduler, index gateway, compactor), each with its own scaling characteristics, failure modes, and config sections. Plus memcached (three tiers). Plus the object storage backend. Plus the agent fleet. Plus the metric pipeline you need to monitor Loki itself (Prometheus, Grafana, dashboards). At meaningful scale, the dev.to post documents ~30 PromQL queries you need to monitor regularly, plus 7+ alert rules they consider essential.
Where it comes from: Microservices for scaling-each-bottleneck-independently. The architecture is correct, but it’s a lot.
What it costs you in practice: A platform team to operate it. Estimates vary, but most orgs running Loki at TB/day scale have at least one full-time engineer who can pattern-match Loki failure modes, plus on-call rotation that includes it. The “easy to operate” marketing is relative to Elasticsearch, not absolute.
Dealbreaker when: You don’t have a dedicated platform/SRE team. You can run monolithic Loki without much overhead (the docs even recommend this for small setups), but you’ll grow out of it before you want to.
Downside 4: You can’t reformulate questions you didn’t anticipate.
The downside: Because labels are baked in at ingest, the queries you can run fast are limited to the labels you chose two months ago. If your team realizes “we should have been labeling by customer_tier,” you can start now — but you can’t retroactively label historical data. The only retroactive option is parse-at-query-time, which is slow.
Where it comes from: The label index is computed at write time, not query time (Core Idea 1 again). It’s append-only.
What it costs you in practice: Your debugging vocabulary is constrained by past architectural choices. Compared to Elasticsearch (where every field is queryable as long as it was logged) or ClickHouse (where you can re-index), Loki is rigid.
Dealbreaker when: Your team’s queries are highly exploratory and unpredictable. You’re discovering new dimensions to slice by every week. Loki will frustrate you.
Downside 5: The “free” version has steep hidden costs.
The downside: Loki is AGPLv3 open source. There’s no license fee. But running it in production requires: a multi-AZ object storage budget, a memcached fleet (multiple tiers), enough EC2/GKE compute for 7+ components, a Prometheus + Grafana stack to monitor it, and engineering time. For most orgs at meaningful scale, the all-in cost is in the high five to mid six figures per year.
Where it comes from: Distributed systems have distributed-system costs, regardless of license.
What it costs you in practice: People reach for self-hosted Loki to escape Datadog bills and then realize the savings are smaller than expected once they account for the platform team’s time. Grafana Cloud Logs (the managed offering) often makes more economic sense than self-hosting up to mid-scale.
Dealbreaker when: Honestly, never — self-hosting can be cheaper than Datadog/Splunk. But the savings need to be modeled with full operational cost included. “It’s free” is a lie that’s bitten many teams.
Downside 6: Multi-tenant fairness is opt-in and partial.
The downside: One tenant can hurt others. Yes, there’s per-tenant rate limiting, per-tenant query parallelism, the query scheduler does per-tenant fair queueing. But it’s still possible for a single tenant to issue a query so expensive that it consumes significant cache space, evicts other tenants’ hot chunks, and degrades everyone’s latency. Shuffle sharding helps but doesn’t eliminate this.
Where it comes from: Memory caches don’t have hard tenant isolation. Bloom filters, chunk caches, and result caches are shared across tenants.
What it costs you in practice: Platform teams report ongoing noisy-neighbor incidents at scale. You’ll need quotas, education, and the willingness to enforce limits even when product teams complain.
Dealbreaker when: You have hostile or wildly heterogeneous tenants where one tenant’s misuse hurting others is unacceptable. Consider physical isolation (separate clusters per major tenant) over logical multi-tenancy.
Downside 7: Schema migrations are sticky decisions.
The downside: Choosing the wrong schema (boltdb-shipper vs tsdb, v11 vs v13) early on creates a long migration tail later. The migration path works but requires dual-write across both schemas for the duration of your retention window, careful planning, and operational care. Many teams put this off for years.
Where it comes from: Index files are written at ingest, immutably. You can’t “re-index” old data without effectively re-ingesting it.
What it costs you in practice: Real engineering time. A schema migration on a TB/day cluster is a multi-week project with risk.
Dealbreaker when: Never, but pick the right schema up front (TSDB + v13 + structured metadata enabled) so you never face this.
Downside 8: Native OTLP support is real but the model translation isn’t clean.
The downside: Loki 3.0+ accepts OpenTelemetry Protocol natively, which sounds great. But OTLP’s data model (resource attributes, scope attributes, log record attributes) doesn’t map cleanly to Loki’s (labels + structured metadata + line). The translation is configurable, but every team that adopts OTLP has to think carefully about which attributes become labels (cardinality risk) vs. structured metadata vs. parts of the log body.
Where it comes from: Two systems with different mental models being bridged. OTLP was designed for traces first; logs are an afterthought there. Loki was designed for the Prometheus model; OTLP is grafted on.
What it costs you in practice: Onboarding new services takes thought. The wrong default (every OTLP attribute as a label) is a cardinality bomb waiting to happen. Grafana ships a sensible default list, but it’s not universal — k8s.pod.name and service.instance.id are explicitly called out as risky-by-default.
Downside 9: When it breaks, the failure modes are subtle.
The downside: Loki doesn’t have many “the whole thing is down” failures (kudos to its design). What it does have is gradual degradation: queries get slower; some logs go missing; ingester memory creeps up. These are harder to alert on than hard failures. You need a real metrics + alerting + canary investment to catch problems before users do.
Where it comes from: Distributed systems with replicas and fallbacks tend to degrade gracefully — which is good, except that “graceful degradation” can also mean “silently degraded for two weeks before anyone noticed.”
What it costs you in practice: You need the Loki Canary (or equivalent), the recommended Loki mixin dashboards, and explicit SLOs (e.g., “P99 query latency under 5s; missing-data canary lag under 30s”). If you don’t have these, your Loki will be subtly broken for periods you can’t measure.
12. The Taste Test
What experienced Loki usage looks like, vs. what tutorials produce.
Labeling strategy
Beginner:
loki.write "default" {
endpoint { url = "..." }
external_labels = {
cluster = "prod",
pod = constants.hostname,
container = "api",
user_id = "unknown", # placeholder; will be set by app
region = "us-east-1",
version = env("VERSION"),
level = "info", # default; pipeline will override
trace_id = "", # placeholder
request_id = "",
customer = "",
}
}
Eleven labels. Several of them dynamic (user_id, trace_id, request_id, customer). Stream count will be in the millions within a week. This will fail.
Experienced:
loki.write "default" {
endpoint { url = "..." }
external_labels = {
cluster = "prod",
namespace = "payments",
app = "api",
region = "us-east-1",
}
}
# user_id, request_id, trace_id are attached as structured metadata in the pipeline,
# not as index labels. Customer is queried via | json | customer="acme".
Four labels. All static. All bounded. All things users actually filter by. High-cardinality stuff goes to structured metadata or parse-at-query-time.
Query patterns
Beginner:
{cluster=~".+"} |= "error"
Every stream. Probably scans most of the cluster’s storage for a 1h query. Will be slow and expensive.
{app="api"} | json
Parses every line for fields you might not even use. Slow.
{app="api"} | json | line_format "{{.message}}" |= "error"
Parses first, filters second. The filter could have happened before the parse step and eliminated 99% of the lines.
Experienced:
{cluster="prod", namespace="payments", app="api"} |= "error" != "expected_error"
Tight selector, line filter early, exclusion to skip known-noise. Fast.
{cluster="prod", app="api"} |= "ERROR" | json | duration > 1s | line_format "{{.timestamp}} {{.route}} {{.duration}}ms {{.message}}"
Filter on text first (cheap), then parse the survivors, then filter on a structured field, then reformat. This is the canonical fast-Loki query.
quantile_over_time(0.99,
{cluster="prod", app="api"} |= "request" | logfmt | unwrap duration_ms [5m]
) by (route)
Realistic p99 latency from logs. Wrapped in quantile_over_time for use in a dashboard or alert.
Config
Beginner Helm values:
loki:
schemaConfig:
configs:
- from: 2020-01-01
store: boltdb-shipper
schema: v11
storage:
type: filesystem # because the tutorial said so
BoltDB shipper. Schema v11. Local filesystem. Will fail to scale, won’t support structured metadata, will need to be migrated.
Experienced Helm values:
loki:
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: s3
schema: v13
index:
prefix: index_
period: 24h
storage:
type: s3
s3:
endpoint: s3.us-east-1.amazonaws.com
region: us-east-1
bucketnames: my-loki-chunks
limits_config:
allow_structured_metadata: true
max_global_streams_per_user: 50000
max_label_names_per_series: 15
retention_period: 168h
reject_old_samples: true
reject_old_samples_max_age: 168h
split_queries_by_interval: 15m
TSDB. Schema v13. S3. Structured metadata enabled. Stream cap. Label name cap. Retention enforced. Old samples rejected. Queries parallelized at 15-minute granularity.
Operational signals
Beginner dashboard: A graph of “logs per second received.” That’s it.
Experienced dashboard: Eight to twelve panels per logical area:
- Write path: ingest rate (bytes + lines), push P99 latency, discarded samples by reason, distributor → ingester failures.
- Cardinality: total streams, streams per tenant, stream creation rate (anomaly indicator).
- Ingester health: memory chunks, chunk flush rate, chunk age at flush, WAL bytes in use, WAL corruption (alert on any).
- Read path: query latency P50/P99, queries per second by status code, query queue length.
- Cache: hit rate per cache tier (chunks, results, index), memcached pod count.
- Compactor: pending delete requests, time since last successful retention apply.
With alerts wired up for: any loki_panic_total, any WAL corruption, sustained discarded samples, ingester append failures, P99 push latency > 1s, chunk cache hit rate < 90%, abnormally high stream count, compactor stalled.
The mark of someone who actually runs Loki
- They can quote their cluster’s active stream count from memory.
- They’ve written at least one recording rule because the same dashboard query was running too often.
- They have an opinion about
chunk_target_size,chunk_idle_period, andmax_chunk_ageand can defend their values with reference to ingest rate. - They’ve shipped at least one PR or runbook entry about “do not add this label” to their org.
- They’ve successfully restored from a failed ingester without data loss because their WAL + RF were configured correctly.
- They know which agent their org uses (Alloy/Promtail/Fluent Bit/OTel) and have at least a directional opinion about whether to migrate.
13. Where to Go Deeper
A curated, opinionated list. Read these in roughly this order.
1. The official architecture page.
grafana.com/docs/loki/latest/get-started/architecture
The single best primary source. Re-read it after this document — you’ll get more out of it now.
2. The official label best practices page.
grafana.com/docs/loki/latest/get-started/labels/bp-labels
The exact thing to hand to a colleague who just added request_id as a label. Short and direct.
3. “Running Grafana Loki in Production: What We Actually Learned” (Sriram Rajendran, dev.to, 2026).
The most useful operational write-up published in the last year. Real numbers, real configs, real PromQL. Reference it whenever you’re sizing a Loki cluster.
4. The Grafana Labs debugging blog: “A (de)bug’s life: Diagnosing and fixing performance issues in Grafana Loki’s read path.”
The maintainers describing how they diagnose Loki problems in production. Worth re-reading every time you start an investigation.
5. The Loki 3.0 release blog.
grafana.com/blog/grafana-loki-3-0-release-all-the-new-features
Bloom filters, structured metadata, native OTLP. The shape of where Loki is going.
6. The LogQL query reference.
grafana.com/docs/loki/latest/query/log_queries
Bookmark this. You’ll come back to it monthly.
7. The Loki Improvement Documents (LIDs).
grafana.com/docs/loki/latest/community/lids
Like RFCs for the project. Reading these (especially LID 0002 on remote rule evaluation and 0003 on query fairness) gives you the maintainers’ mental model.
8. Hands-on: run the Loki Canary in your cluster.
grafana.com/docs/loki/latest/operations/loki-canary
Not a reading recommendation — a doing recommendation. Deploy the canary, generate some load, and observe what its metrics tell you. There’s no better way to internalize the read/write/ingest interactions.
14. The Final Verdict
Loki is a tool with a strong, narrow opinion: most log queries are filter-by-source-then-grep, and we will be cheaper than everything else for that workload. If your queries match that shape — and for the typical Kubernetes-and-Prometheus shop, they do — Loki is genuinely the right answer. The economics are real (10x cheaper than Elasticsearch is not marketing, it’s structurally true once you’ve absorbed the architecture), the operational model fits naturally alongside Prometheus, and the LogQL query language, while initially awkward if you’re coming from Lucene, becomes muscle memory within a week.
What it gets profoundly right: the decision to throw away the inverted index. This is not a small thing — it took courage to say “the dominant log query is label-filter-then-grep, and that’s enough.” Most systems try to be everything to everyone, and the price is operational complexity that compounds forever. Loki picked a side. The metrics-and-logs unification around labels and LogQL/PromQL is the other thing it nails — a junior engineer who knows Prometheus can be productive in Loki in a day.
What it costs you: cardinality discipline as a permanent organizational concern, slow arbitrary-text-search, and an operational surface that’s substantial despite the marketing. The “easy to run” claim is comparative, not absolute. If you don’t have a platform team that can keep ingesters from OOMing and educate product teams about labels, Loki will fight you. And there’s a real ceiling on how exploratory your investigations can be — if your team often asks “what if we sliced this by an attribute we never indexed?”, the answer in Loki is “scan everything, and wait a while.” Elasticsearch shrugs at that question; Loki sweats.
Who should reach for this: organizations on Kubernetes + Prometheus + Grafana whose dominant log workflow is targeted debugging (“I know which service had problems; show me its logs around the incident”). Cost-conscious teams with engineering capacity to operate distributed systems. Teams that value the same operational model across metrics, logs, and traces. Most modern observability shops at most modern companies.
Who shouldn’t: security teams who need full-text search across everything as their primary workflow (use Elasticsearch). Data-analysis teams treating logs as a queryable dataset for business questions (use ClickHouse, BigQuery, or Snowflake). Small organizations without platform engineering capacity who would be better served by a SaaS option (Grafana Cloud Logs is, ironically, often the right call here). Anyone whose query pattern is genuinely “I don’t know which service, just find the text.”
What to believe and what not to believe:
- Believe that labels are the most important decision you’ll make. Decide carefully and revisit rarely.
- Believe that caching is the entire performance story. Memcached sizing matters more than querier count.
- Don’t believe that you can “just add more labels” to make queries faster. You can’t. You’ll make everything slower and more expensive.
- Don’t believe that Loki is “operationally simple.” It’s simpler than Elasticsearch. It’s not simple. Plan accordingly.
The hard-won line: Loki rewards teams that take a strong position on what they will and won’t query, and punishes teams that want flexibility for its own sake. If you can be opinionated about your observability data, Loki will be a partner. If you can’t, it’ll be a series of escalating incidents that look like cardinality explosions but are really an unwillingness to choose.
Pick what you’ll query. Label that, only that, and nothing else. Grep the rest. That is the whole game.
The ideas are mine. The writing is AI assisted