Production Monitoring: The Metrics That Actually Tell You Something Is Wrong

Most teams I work with are over-monitored and under-observant. They have dashboards full of metrics — CPU usage, memory consumption, disk I/O, network bytes — and then something catastrophic happens and none of those metrics told them anything useful beforehand. The database connection pool saturated silently. The background job queue backed up for six hours. Users were getting 503 errors while every server health check showed green.

The problem is not the absence of monitoring. It is monitoring the wrong things. Let me tell you what I actually watch in production and why.

The Four Golden Signals

The Site Reliability Engineering book from Google defined four signals worth measuring for every production service. Fifteen years later, this framework is still the best starting point I know.

Latency — how long requests take to process. The crucial detail is measuring this correctly: track the latency of successful requests separately from failed requests. A spike in error rate with fast error responses can make your average latency look healthy while users are experiencing failures. Percentiles matter more than averages — p99 latency tells you what your slowest 1% of users experience. That is often the number that predicts support tickets.

Traffic — how much demand your service is handling. Requests per second for HTTP services, messages per second for queues, queries per second for databases. Traffic is the demand signal. When combined with latency and errors, traffic tells you whether a degradation is correlated with load or happening regardless of load.

Errors — the rate of requests that fail. Track explicit failures (5xx HTTP responses) separately from implicit failures (200 responses with error payloads, timeouts that resolve with empty data). Many error conditions masquerade as successes at the protocol level.

Saturation — how full your service is. CPU and memory are the obvious ones, but more important for most application servers are: database connection pool use, open file descriptor count, thread pool queue depth. A service operating at 70% of its connection pool limit needs attention before the pool exhausts.

Alert on golden signals, not infrastructure metrics. CPU usage is a poor predictor of user-visible problems. Error rate is an excellent predictor.

Setting Alert Thresholds That Mean Something

Bad alerting is worse than no alerting. Alert fatigue — where your on-call rotation ignores alerts because they fire constantly and are almost always false positives — is a genuine organizational problem. It means the alert that matters gets ignored along with the noise.

Alert on symptoms, not causes. "Error rate above 1% for 5 minutes" is a symptom alert. "CPU above 80%" is a cause alert. Cause alerts require you to make a judgment about whether this CPU spike will cause user-visible problems. Symptom alerts tell you user-visible problems are already happening.

Set your alert thresholds based on observed baselines, not guesses. Instrument your system for a week, establish what normal looks like, and set alerts at meaningful deviations. A p99 latency spike to 3 seconds means something different for a batch processing service than for a payment API.

Use multi-condition alerts where appropriate. A single machine showing high CPU is probably fine. All machines showing high CPU simultaneously is a serious event. Your alerting system should be able to express this distinction.

What to Actually Instrument

Every API endpoint needs latency and status code tracking. In Node.js with Express, a middleware handles this:

import { Request, Response, NextFunction } from "express";
import { metrics } from "./metrics"; // your metrics client

Export function httpMetricsMiddleware(
 req: Request,
 res: Response,
 next: NextFunction
): void {
 const start = process.hrtime.bigint();

 res.on("finish", () => {
 const duration = Number(process.hrtime.bigint() - start) / 1e6; // ms
 metrics.histogram("http.request.duration", duration, {
 method: req.method,
 route: req.route?.path ?? "unknown",
 status: String(res.statusCode),
 });
 metrics.increment("http.requests.total", {
 method: req.method,
 route: req.route?.path ?? "unknown",
 status: String(res.statusCode),
 });
 });

 next();
}

Tag every metric with route and status code. This lets you identify which specific endpoints are slow or erroring, not just that something is wrong somewhere in your service.

For database queries, track query duration and error rate at the per-query level. Most ORM-level slow query logs capture this, but streaming it to your metrics system lets you correlate database degradation with API latency spikes.

For background jobs, track queue depth, job processing time, and job failure rate. A job queue that is growing is a latent problem. A job queue that is growing while failure rate is climbing is an active incident.

Synthetic Monitoring for External Validation

Internal metrics tell you what your servers observe. Synthetic monitoring tells you what users experience. These are different things.

A synthetic monitor makes real HTTP requests to your production endpoints from external locations on a schedule — every minute or every five minutes. When the request fails or takes longer than your threshold, you alert. This catches situations your internal monitoring misses: your application is healthy but a DNS failure is preventing users from reaching it, your CDN is serving a cached error page, your TLS certificate expired.

For simple HTTP checks, services like Checkly, Better Uptime, or Freshping cost under $30/month and provide meaningful coverage. Set up checks for your homepage, your most critical API endpoints, and your health check endpoint. Verify response content, not just HTTP status — a 200 response with "Service Unavailable" in the body has happened to me.

Log-Based Alerting

Metrics are great for quantitative signals. Logs are essential for qualitative ones. Some error conditions only become visible when you look at what is being logged.

Structure your logs (covered in depth in the structured logging article) and ship them to a searchable log management tool: Datadog, Grafana Loki, Axiom, or Elasticsearch. Create alerts based on log patterns:

More than 10 occurrences of "payment failed" in 5 minutes
Any occurrence of "database connection refused"
Authentication failure rate above a baseline (potential credential stuffing attack)

Log-based alerts catch the errors your code logs but does not count as HTTP errors. Your application might return a 200 with an empty dataset when the database query fails silently. The log line "Query returned zero results: expected non-empty" is the signal.

Dashboards That Communicate at a Glance

A good monitoring dashboard answers "is my service healthy right now" in under three seconds. If you need to study the dashboard to determine system health, the dashboard is too complex.

My standard production dashboard has four panels at the top: current error rate, p99 latency (past hour), current requests per second, and current active connections or thread pool use. Below that, breakdowns by endpoint. These six numbers tell me whether I have a problem and roughly where it is.

Avoid dashboards that show metrics without context. A CPU graph means nothing without a historical baseline. Show the current value alongside a 24-hour sparkline. Show alert thresholds as horizontal lines on the chart. Make normal obvious so abnormal is immediately recognizable.

The On-Call Reality

Monitoring is only useful if someone acts on it. Set up your alerting so that pages go to the person who can actually address them, at times when they can actually address them. Schedule-based on-call rotation, clear escalation paths, and documented runbooks for common alerts are as important as the metrics themselves.

Review your alert history monthly. Alerts that fired but required no action are candidates for tuning. Incidents that happened without an alert firing indicate monitoring gaps. Continuous improvement of your monitoring coverage is ongoing operational work, not a one-time setup task.

Struggling to make sense of your monitoring setup or alert on what actually matters? Let's build a monitoring strategy that fits your system. Book a call at https://calendly.com/jamesrossjr.