Batch Processing Architecture for Large-Scale Data

Not Everything Needs to Be Real-Time

The industry has a real-time bias. Stream processing, event-driven architectures, WebSocket updates, sub-second latency targets. These are powerful patterns for the problems they solve. But a surprising number of business-critical operations don't need real-time processing and are actually better served by well-designed batch systems.

Payroll runs. End-of-day financial reconciliation. Report generation. Data warehouse loading. Bulk notifications. Invoice generation. These are operations that process large volumes of data on a schedule, where throughput matters more than latency and reliability matters more than speed.

The problem is that batch processing doesn't get the same architectural attention as real-time systems. Teams often implement batch jobs as cron-triggered scripts with minimal error handling, no monitoring, and no recovery mechanism. When these jobs fail at 2 AM processing 500,000 records, the on-call engineer is left reverse-engineering a script to figure out where it stopped and how to resume.

Good batch architecture is boring on purpose. It's predictable, observable, recoverable, and testable.

Core Patterns for Reliable Batch Processing

Chunk-based processing. Never process an entire dataset as a single unit of work. Break the input into chunks — 100 records, 1,000 records, whatever size allows each chunk to complete in a reasonable time and be committed independently. If a batch job processing 200,000 invoices fails at record 150,001, chunk-based processing means you've already committed the first 150,000 and only need to retry the current chunk.

The chunk size involves a tradeoff. Smaller chunks mean more frequent commits and finer-grained recovery, but higher overhead from transaction management and progress tracking. Larger chunks mean less overhead but coarser recovery. For most enterprise workloads, chunks of 500 to 2,000 records hit the sweet spot.

Idempotent operations. Every operation in a batch job should be safe to retry. If you're generating invoices, running the job twice for the same input should not create duplicate invoices. This means either checking for existing output before creating new records, or using deterministic identifiers that make duplicate writes a no-op.

Idempotency is what makes recovery simple. If a job fails and you restart it, idempotent operations mean you can re-process records that may have already been processed without corrupting data.

Progress tracking and checkpointing. The batch system should persistently track which chunks have been completed. When a job restarts after failure, it reads the checkpoint and resumes from where it left off. This tracking belongs in a database, not in memory or log files.

A simple checkpoint table works well: job ID, chunk identifier, status (pending, processing, completed, failed), started_at, completed_at, error message if failed. This table is also your monitoring dashboard.

Architecture for Scale

When batch volumes grow beyond what a single process can handle in the available time window, you need parallel processing. The architecture for parallel batch processing has a few established patterns.

Partitioned processing. Divide the input dataset into partitions — by customer ID range, by date, by geographic region — and process each partition independently. Partitions can run on different servers or in different processes on the same server. The key constraint is that partitions must be independent: no partition should need to read or write data that belongs to another partition.

This maps naturally to the distributed systems fundamentals principle of shared-nothing architecture. Each partition owns its data, does its work, and reports its status.

Leader-worker coordination. A leader process scans the input, creates work items, and writes them to a queue. Worker processes pull items from the queue and process them independently. This decouples the rate of work discovery from the rate of work execution and lets you scale workers horizontally.

The queue provides natural backpressure and load balancing. If one worker is slow (maybe it's processing a particularly complex record), the other workers pick up the slack. If a worker crashes, its in-progress items time out and become available for another worker to pick up.

Time window management. Most batch jobs have a time window — the nightly job must complete before business hours, the monthly close must finish before the reporting deadline. Monitor your batch execution times and alert when they approach the window boundary. A job that takes 4 hours today in a 6-hour window will take 8 hours after your data doubles if you don't plan for it.

Error Handling and Recovery

Batch jobs fail. Records have bad data. External services are unavailable. Disk fills up. The quality of a batch system is measured by how gracefully it handles failure, not by whether it fails.

Record-level error isolation. A single bad record should not fail the entire batch. Isolate processing errors to the individual record: log the error, mark the record as failed with the reason, and continue processing the rest of the chunk. After the batch completes, you have a clear list of failed records that can be investigated and reprocessed.

Retry with backoff. For transient errors — network timeouts, database connection drops, rate-limited API calls — implement automatic retry with exponential backoff at the chunk level. Three retries with increasing delays handles most transient issues. After the retry limit, mark the chunk as failed and move on.

Dead letter handling. Records that fail repeatedly after retries need to go somewhere for human review. A dead letter table or queue collects these permanently-failed records with their error details. This is essential for operations teams who need to understand why records are failing and fix the upstream data.

Compensation and rollback. Some batch operations need the ability to undo their work. If you're posting journal entries and the batch fails halfway through, can you reverse the posted entries? Design compensation operations upfront for any batch that modifies financial or compliance-sensitive data.

The patterns here overlap significantly with what you'd apply in event-driven architecture — the difference is that batch processing applies them in scheduled bursts rather than continuous streams.

Monitoring and Observability

A batch system without monitoring is a time bomb. You need visibility into three things.

Job-level metrics. Did the job start? Did it finish? How long did it take? How many records were processed? How many failed? These go into your monitoring dashboard and your alerting rules.

Trend analysis. Is the job taking longer each week? Is the failure rate increasing? Batch jobs that gradually slow down are signaling that your data volume is outgrowing your processing capacity or that a dependency is degrading.

Business-level validation. After a batch completes, validate the output against business expectations. If your nightly invoice generation usually produces 800-1,200 invoices and tonight it produced 12, something is wrong even though the job technically succeeded. Anomaly detection on batch output catches problems that technical monitoring misses.

Batch processing isn't glamorous, but it's the backbone of most enterprise data operations. Getting the architecture right means the difference between systems that run unattended for years and systems that wake someone up every week.

If you're designing a batch processing system or scaling an existing one, let's talk through the architecture.