Log Aggregation Architecture for Distributed Systems
Design a log aggregation system for distributed applications — collection, transport, storage, indexing, and building dashboards that help you find problems fast.
Strategic Systems Architect & Enterprise Software Developer
When your application runs on a single server, logs are simple — they are in a file, you tail it, you find the problem. When your application runs across ten services on fifty containers, logs are scattered. The request that failed touched four services, and the relevant log lines are in four different containers that might have been replaced since the error occurred. Without aggregation, debugging distributed systems is archaeology — piecing together fragments from dig sites you may no longer have access to.
Log aggregation collects logs from every service and container into a centralized, searchable system. The architecture of that system determines whether you can find the needle in the haystack within minutes or spend hours correlating timestamps across terminals.
The Collection Layer
Log collection starts at the source. Each application writes structured logs — JSON, not free-form text — with consistent fields that make searching possible:
const logger = createLogger({
format: 'json',
defaultMeta: {
service: 'api-gateway',
version: process.env.APP_VERSION,
environment: process.env.NODE_ENV,
},
})
Logger.info('Request processed', {
requestId: req.id,
method: req.method,
path: req.path,
statusCode: res.statusCode,
duration: elapsed,
userId: req.user?.id,
})
The requestId is the most important field for distributed tracing. When a request enters your system, assign it a unique ID and propagate that ID through every service it touches. Searching for a request ID returns every log line from every service related to that request — this is the difference between "I can debug this" and "I have no idea what happened."
Collection agents run on each host or as sidecar containers. Fluentd, Fluent Bit, and the OpenTelemetry Collector are the standard choices. They read logs from stdout (for containers), files (for traditional deployments), or direct API submission, then forward them to the aggregation layer.
# Fluent Bit configuration for Kubernetes
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Refresh_Interval 5
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
K8S-Logging.Parser On
[OUTPUT]
Name es
Match *
Host elasticsearch
Port 9200
Index logs
Type _doc
Fluent Bit is lighter than Fluentd and handles the collection-and-forwarding role well for most deployments. If you need complex log transformation or routing, Fluentd's plugin ecosystem is broader. The OpenTelemetry Collector merges logs with traces and metrics into a single pipeline, which simplifies the infrastructure monitoring stack.
Storage and Indexing
The aggregation backend stores logs and makes them searchable. The two dominant approaches are:
Elasticsearch (or OpenSearch) — full-text search engine that indexes log fields for fast querying. Elasticsearch handles billions of log lines and returns results in seconds. The operational complexity is its downside — managing cluster health, shard allocation, index lifecycle, and storage costs requires ongoing attention.
Loki — a newer approach from Grafana Labs that stores log lines as compressed chunks and indexes only the metadata labels (service name, environment, pod name). Queries that filter by labels are fast; queries that search within log text are slower. Loki is dramatically cheaper to operate than Elasticsearch because it does not build full-text indexes.
For most teams, Loki provides the right balance. You search by service, time range, and severity level 90% of the time — these are label queries that Loki handles well. The 10% of cases where you need full-text search are slower but still functional.
Retention policies matter for cost. Storing every log line forever is expensive and unnecessary. A common approach: keep the last 7 days at full resolution, aggregate to summary metrics for 30 days, and archive to cold storage for compliance needs. Define the retention policy before you have a storage cost crisis, not after.
Structured Logging Standards
The value of aggregated logs depends entirely on their structure. Unstructured log lines like "User 12345 logged in at 2025-09-15" are human-readable but machine-hostile. Structured logs with consistent field names enable filtering, aggregation, and alerting:
{
"timestamp": "2025-09-15T14:30:00Z",
"level": "info",
"service": "auth",
"message": "User authenticated",
"userId": "12345",
"method": "password",
"duration": 142,
"requestId": "req_abc123"
}
Establish a logging standard across all services. At minimum, every log line should include: timestamp, level, service, message, and requestId. Beyond that, each service adds domain-specific fields relevant to its operations.
Log levels should be consistent and meaningful. error means something failed and needs attention. warn means something unexpected happened but was handled. info means a significant business or operational event occurred. debug is disabled in production unless you are actively investigating an issue.
Do not log sensitive data. User passwords, API keys, credit card numbers, and personal information should never appear in logs. This is a security requirement and often a legal requirement under GDPR or HIPAA. Implement a log sanitizer that strips known sensitive fields before logs leave the application, and review log output during code review. The environment variable discipline that keeps secrets out of code should extend to keeping them out of logs.
Dashboards and Alerts
Aggregated logs are raw material. Dashboards transform them into operational awareness. The minimum set of log-based dashboards:
Error rate by service — a time series showing error log volume per service. This is your primary alert source. A sudden increase in errors from any service triggers an investigation.
Latency distribution — if you log request duration, plot the p50, p95, and p99 over time. Latency regressions often appear in p99 before they affect p50, giving you early warning.
Top errors — group error logs by message (or error code) and show the most frequent. This identifies recurring issues and helps prioritize fixes.
# Loki query: error rate by service over 5 minutes
sum by (service) (rate({level="error"} [5m]))
Alerts should fire on meaningful thresholds, not on individual log lines. "Error rate exceeds 5% for 3 consecutive minutes" is actionable. "An error log was written" is not — every production system produces some errors. Set alert thresholds based on historical baselines and adjust them as you learn what is normal for your system.
Connect your log aggregation to your incident response process. When an alert fires, the responder should be able to click through from the alert to the relevant logs, filtered to the time window and service in question. Every click between the alert and the root cause adds response time. The goal is a single click from "something is wrong" to "here are the logs that explain what."