Infrastructure Monitoring: What to Watch and Why

Monitoring is one of those areas where more is not better. I have seen teams with 200 alerts that ignore all of them and teams with 8 alerts that catch every real incident. The difference is not tooling — it is knowing what matters. Most infrastructure metrics are noise. A small set of signals tells you whether your system is healthy, degrading, or failing. Everything else is context that helps diagnose problems after you know they exist.

Here is what to monitor, how to alert on it, and how to build dashboards that you actually look at.

The Four Golden Signals

Google's Site Reliability Engineering book identified four signals that capture the health of any service. This framework has held up because it is complete without being overwhelming:

Latency — how long requests take. Track the full distribution, not just the average. An average latency of 200ms can hide a p99 of 5 seconds that affects 1% of users. Monitor p50 (median), p95, and p99 at minimum.

Traffic — how much demand the system is handling. Requests per second for web services, messages per second for queues, queries per second for databases. Traffic gives you context for the other signals — high latency during a traffic spike means something different than high latency during normal load.

Errors — the rate of failed requests. Track both explicit errors (5xx responses, exception counts) and implicit errors (successful responses with incorrect content, timeouts counted as successes). An error rate of 0.1% is typically normal; 1% is concerning; 5% is an incident.

Saturation — how full the system is. CPU use, memory usage, disk I/O, open connections, thread pool use. Saturation signals predict problems before they cause failures — 90% CPU use is not failing yet but will be soon.

# Prometheus alert rules for the four signals
groups:
 - name: golden_signals
 rules:
 - alert: HighLatency
 expr: histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m])) > 2
 for: 5m
 labels:
 severity: warning

 - alert: HighErrorRate
 expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
 for: 3m
 labels:
 severity: critical

 - alert: HighCPU
 expr: node_cpu_utilization > 0.85
 for: 10m
 labels:
 severity: warning

Every infrastructure alert you create should map to one of these four signals. If it does not, question whether it belongs in alerting at all.

Alert Design That Prevents Fatigue

Alert fatigue kills monitoring effectiveness faster than anything else. When the team receives 50 notifications a day, they stop reading them. When they stop reading them, the critical alert that matters gets ignored along with the noise.

Rules for sustainable alerting:

Alert on symptoms, not causes. Alert on "error rate is above 5%" (symptom), not "CPU is above 70%" (cause). High CPU is not always a problem. High error rate always is. You investigate causes after the symptom alert fires.

Use severity levels consistently. Critical means "someone needs to respond now — users are affected." Warning means "something is degrading and will become critical if unaddressed." Info means "noteworthy but not actionable right now." Critical alerts page on-call. Warnings go to a channel. Info goes to a dashboard.

Require the alert to persist. The for: 5m clause in Prometheus means the condition must be true for five continuous minutes before firing. This eliminates transient spikes that resolve on their own. A brief CPU spike during a garbage collection cycle is not an incident.

Every alert must have a runbook. When the alert fires, the responder should know what to check first, what to look at second, and when to escalate. An alert without a runbook is a puzzle, and puzzles are slower to solve at 3 AM. Link the runbook directly in the alert notification.

If an alert fires and the correct response is always "ignore it," delete the alert. If an alert fires and the correct response is always the same remediation, automate the remediation and downgrade the alert to info.

Dashboard Design

Dashboards are for continuous awareness, not for incident response. The team should glance at the dashboard during the day and immediately understand whether things are healthy. This means the dashboard must be scannable in under 10 seconds.

The recommended layout for a service dashboard:

Top row — key business metrics: active users, request rate, revenue (if applicable). These provide context for everything below.

Second row — the four golden signals for the primary service. Color-coded thresholds: green is healthy, yellow is warning, red is critical.

Third row — infrastructure saturation: CPU, memory, disk, connections. These are the leading indicators of future problems.

Bottom — recent deployments overlaid on the metric graphs. Correlating metric changes with deployments is the fastest way to identify deployment-related issues.

┌─────────────────────────────────────────┐
│ Active Users: 1,234 │ RPS: 450 │ ...│
├─────────────────────────────────────────┤
│ Latency p99: 180ms │ Error: 0.1% │
│ [graph over time] │ [graph] │
├─────────────────────────────────────────┤
│ CPU: 45% │ Memory: 62% │ Disk: 30% │
│ [graph] │ [graph] │ [graph] │
├─────────────────────────────────────────┤
│ Deployment markers on timeline │
└─────────────────────────────────────────┘

Avoid dashboards with 30 panels. They become wallpaper — always visible, never read. Five to eight panels per dashboard, focused on one service or one user journey. Create separate dashboards for different concerns rather than one dashboard that covers everything.

Tool Selection

The monitoring ecosystem has consolidated around a few stacks:

Prometheus + Grafana — the open-source standard. Prometheus scrapes metrics from your services, stores them as time series, and evaluates alert rules. Grafana visualizes the data and provides dashboards. This stack is free, widely supported, and runs on your own infrastructure.

Datadog / New Relic / Dynatrace — commercial platforms that provide metrics, logs, traces, and alerting in a single product. Higher cost, lower operational burden. The value is in the correlation features — clicking from a metric anomaly to the related logs to the distributed trace that explains the root cause.

Cloud-native tools — CloudWatch (AWS), Cloud Monitoring (GCP), Azure Monitor. Tight integration with their respective cloud services, limited cross-cloud support. Good enough for teams running entirely on one cloud provider.

For most teams, Prometheus + Grafana is the right starting point. It is free, the community knowledge base is extensive, and it integrates with every major container orchestration platform. Migrate to a commercial platform when the operational burden of self-hosting monitoring becomes a significant time drain, or when correlation features would meaningfully reduce incident response time.

The best monitoring setup is the one your team actually uses. A simple dashboard that gets checked daily is worth more than an elaborate monitoring system that nobody looks at. Start with the four golden signals, alert on real problems, and add complexity only when existing monitoring fails to catch an issue you should have seen coming.