Error Handling Patterns for Production Applications

Why Error Handling Is an Architecture Problem

Error handling is one of those concerns that every developer acknowledges as important and few approach systematically. The default pattern is reactive: something breaks in production, the team adds a try/catch, maybe some logging, and moves on. Over time, this produces a codebase with inconsistent error handling where some errors are caught and silently swallowed, others crash the process, and most land somewhere in between — logged with insufficient context and surfaced to users with generic messages that help no one.

The problem is that error handling is not a local concern that can be addressed function by function. It's an architectural concern that spans the entire application. How errors are categorized, how they propagate between layers, what information gets logged, what the user sees, and how the system recovers — these are system-level decisions that require system-level design.

Production applications need an error handling strategy: a set of conventions that every developer on the team follows, producing consistent behavior that operators can predict and users can understand.

Categorizing Errors by Response

Not all errors require the same response, and the biggest mistake in error handling is treating them as if they do. A useful categorization separates errors into three groups based on what should happen when they occur.

Operational errors are expected failures that occur during normal operation. A database connection timeout, a failed HTTP request to a third-party service, user input that fails validation, a file that doesn't exist — these are not bugs. They're conditions the application should anticipate and handle gracefully. The response is typically to retry, fall back to a default, return a meaningful error message to the user, or some combination.

Programmer errors are bugs. A null reference, an index out of bounds, a type assertion that fails — these indicate a mistake in the code that needs to be fixed. The response is typically to fail fast, log comprehensive diagnostic information, and alert the development team. Attempting to recover from programmer errors is usually counterproductive because the application is in an unexpected state.

Infrastructure errors are environmental failures — disk full, out of memory, network partition. These require operational response rather than code-level handling. The application should detect them, report them clearly, and either degrade gracefully or shut down cleanly. Attempting to handle out-of-memory errors in application code is generally futile.

This categorization matters because each category requires different handling. Wrapping everything in a generic try/catch that logs an error and returns a 500 response treats a validation failure the same as a null pointer exception, which helps neither the user nor the developer.

Error Propagation Strategies

How errors travel from the point of occurrence to the point of handling determines the quality of your error handling. Two anti-patterns dominate.

Swallowing errors — catching an exception and doing nothing with it — is the worst anti-pattern. The operation failed, but nothing in the system knows about it. The calling code assumes success and continues with invalid state. The user sees confusing behavior with no error message. The logs show nothing. I've debugged production systems where critical failures were invisible for weeks because someone wrote catch (e) {} during development and never revisited it.

Over-catching — wrapping every function call in try/catch — creates noise and obscures the actual error flow. When errors are caught and re-thrown at every layer with different messages, the original error gets buried under a stack of wrapper messages, and the log entry that would actually help diagnose the problem is hidden beneath generic "something went wrong" messages.

The effective pattern is to let errors propagate naturally to the boundary where they can be handled appropriately. In a web application, that typically means validation errors are handled in the controller layer (returning 400 responses with specific messages), business logic errors are handled in the service layer (returning domain-specific errors), and unexpected errors are caught by a global error handler that logs the full context and returns a safe response to the user.

Define custom error types for your domain. Instead of throwing generic errors with string messages, create error classes that carry structured information: error code, user-facing message, internal diagnostic details, and HTTP status code. This makes error handling in consuming code predictable and type-safe — which matters especially when you're working with TypeScript in strict mode.

Logging Errors for Debuggability

The purpose of error logging is to give the person investigating a production issue enough information to understand what happened without needing to reproduce it. This requires more context than most developers provide.

Log the error message and stack trace, obviously. But also log the input that triggered the error, the state of relevant variables, the user or request that was being processed, and the operation that was being attempted. The difference between "TypeError: Cannot read property 'id' of undefined" and "TypeError while processing order creation for user 12345: Cannot read property 'id' of undefined (payload: {items: ..., shipping: null})" is the difference between a mystery and a diagnosis.

Use structured logging (JSON) rather than free-text log messages. Structured logs can be searched, filtered, and aggregated by machines. A structured log entry with fields for level, error_code, user_id, request_id, and stack_trace is infinitely more useful for investigation than a string that concatenates those values with spaces.

Correlate logs across the request lifecycle using a request ID. When a single user request touches multiple services or passes through multiple middleware layers, a shared request ID lets you trace the complete journey and see exactly where the error originated. This is essential for debugging distributed systems and becomes non-negotiable as your architecture grows beyond a single process.

Recovery and User Communication

When an error occurs, the user needs to know three things: that something went wrong, whether their action succeeded or failed, and what they should do next. Most error messages fail on all three counts, showing either a stack trace (too much information) or "An error occurred" (too little).

Design error messages for the user's context. A payment processing error should tell the user whether they were charged. A form submission error should preserve the form data so the user doesn't have to re-enter it. A file upload error should explain whether they should retry or contact support.

Implement graceful degradation where possible. If a non-critical feature fails — a recommendation engine, an analytics tracker, a social login provider — the core application should continue functioning. This requires designing your error handling to distinguish between critical and non-critical failures and respond proportionally. A production application that crashes because the analytics service is down is not resilient; it's fragile in ways that compound under real-world conditions.