Enterprise Data Pipeline Architecture: Moving Data Reliably at Scale

Data Pipelines Are Infrastructure, Not Projects

Every enterprise has data moving between systems. Sales data flows from the CRM to the data warehouse. Order data flows from the ERP to the accounting system. Customer data flows from the website to the marketing platform. Inventory levels flow from the warehouse management system to the e-commerce storefront.

When these flows are handled by manual exports, scheduled email reports, or ad-hoc scripts, they work until they don't. A script fails silently on a Friday night and Monday morning starts with incorrect inventory counts. A format change in the source system breaks the CSV parser and nobody notices until the monthly financial close is wrong.

Data pipeline architecture replaces these fragile ad-hoc flows with reliable, monitored, recoverable infrastructure for moving data between systems. It's not glamorous work, but it's the foundation that makes enterprise reporting and analytics possible.

ETL vs. ELT: The Architecture Decision

The traditional data pipeline pattern is ETL — Extract, Transform, Load. Data is extracted from source systems, transformed into the target format (cleaned, enriched, aggregated), and loaded into the destination. The transformation happens in the pipeline before the data reaches the target.

The modern alternative is ELT — Extract, Load, Transform. Data is extracted from source systems and loaded into the destination (typically a data warehouse) in its raw form. Transformation happens inside the data warehouse using SQL or a transformation framework. The warehouse's compute resources handle the transformation rather than a separate processing layer.

ETL makes sense when the target system has limited storage or compute (you want to load only clean, aggregated data), when transformations require business logic that's better expressed in application code than SQL, or when the pipeline needs to enrich data from multiple sources before loading.

ELT makes sense when the target is a modern data warehouse with abundant compute (BigQuery, Snowflake, Redshift), when keeping raw data preserves optionality for future analysis, or when transformations are primarily relational operations that SQL handles naturally.

For most enterprise data pipelines today, ELT is the more practical choice. Modern warehouses are designed for exactly this workload, and keeping raw data in the warehouse means you can add new transformations without re-extracting from source systems.

Pipeline Architecture Patterns

Source connectors abstract the details of extracting data from each source system. A connector for a REST API handles pagination, authentication, and rate limiting. A connector for a database handles connection management, query execution, and incremental extraction using timestamps or change data capture. Each connector produces a stream of records in a standardized internal format.

Incremental extraction is critical for performance and scalability. Full extractions — pulling all data from the source on every run — work for small datasets but become impractical as data grows. Incremental extraction tracks the last successfully extracted record (using a timestamp, a sequence number, or a change log) and extracts only new or modified records on each run.

Change Data Capture (CDC) is the gold standard for incremental extraction from databases. CDC captures the stream of changes (inserts, updates, deletes) from the database's transaction log and feeds them into the pipeline. This is more reliable than timestamp-based extraction because it captures deletes and doesn't miss records that were modified between extraction runs. PostgreSQL's logical replication and tools like Debezium provide CDC capabilities.

Transformation layers clean, validate, enrich, and reshape data. In an ELT architecture, these are SQL-based transformations that run in the data warehouse. Tools like dbt (data build tool) provide a framework for defining transformations as SQL models with dependency management, testing, and documentation. Each transformation is versioned, tested, and repeatable.

Orchestration coordinates the execution of pipeline stages. A pipeline that extracts from three source systems, loads into the warehouse, and then runs five transformation models has dependencies: transformations can't run until loads complete, loads can't run until extractions complete. An orchestration layer (Airflow, Dagster, Prefect, or even well-structured cron jobs for simple cases) manages this dependency graph, handles retries on failure, and provides visibility into pipeline status.

Error Handling and Data Quality

Data pipelines fail. Sources go offline. Schemas change without notice. Records have unexpected formats. The quality of a pipeline is measured by how it handles these failures.

Retry with idempotency is foundational. When a pipeline stage fails, the orchestrator should retry it. For retries to be safe, every stage must be idempotent — running it twice with the same input produces the same result without duplication. This means either using upsert operations in the load stage or tracking processed records to skip duplicates.

Dead letter queues collect records that fail validation or transformation. Rather than failing the entire pipeline for a single bad record, move the problematic record to a dead letter queue with the error details, and continue processing. Operations teams can review and remediate dead letter records independently of the pipeline's normal operation.

Schema validation at extraction time catches format changes before they propagate through the pipeline. When the source system changes a column type or adds a required field, the pipeline should detect this mismatch, alert the operations team, and either handle it gracefully or stop processing rather than loading corrupt data.

Data quality checks run after transformation to validate that the output meets expectations. Row counts should be within expected ranges. Aggregate totals should be consistent with source systems. Null rates for required fields should be zero. These checks catch logic errors in transformations and data quality issues in source systems. The patterns here overlap with the distributed systems fundamentals of building reliable systems from unreliable components.

Monitoring and Observability

Pipeline monitoring needs to answer three questions at all times: Is the pipeline running? Is it running correctly? And is it running on time?

Execution monitoring tracks whether each pipeline run started, completed, or failed. Alerts fire on failures. Dashboards show the status of each pipeline and its stages.

Data freshness monitoring tracks the lag between when data was created in the source system and when it's available in the destination. If your pipeline runs every hour but the data in the warehouse is 6 hours stale, something is wrong even if the pipeline reports success — maybe it's processing successfully but processing old data.

Volume monitoring tracks the number of records processed in each run. A sudden drop in volume — the pipeline that usually processes 10,000 records processed 100 — signals a source system issue even though the pipeline itself succeeded. A sudden spike might indicate duplicate extraction or a source system backfill that needs special handling.

Cost monitoring matters for cloud-based pipelines where compute and storage are billed per use. A transformation query that scans the entire warehouse on every run might work functionally but cost 10x what an incremental approach would cost.

Data pipelines are the connective tissue of enterprise data architecture. Build them with the same rigor you'd apply to any production system: tested, monitored, documented, and designed for failure recovery.

If you're designing data pipeline architecture, let's discuss the right approach for your systems.