While every engineer’s implementation is shaped by their business objectives and data source, most pipeline projects share a consistent sequence of steps
Building a reliable data pipeline involves a blend of engineering discipline, architectural foresight, and operational rigor. The goal of data pipelines is to squeeze more value from a company’s data once most of it has been digitally transformed. Data pipelines make integrated and verified data available for one or more of the following uses:
- Operational applications such as manufacturing planning or control.
- Data analysis and visualizations, including dashboards, for problem analysis and forecasting.
- AI applications for insights into manufacturing troubleshooting, optimization, or innovation.
While every engineer’s implementation is shaped by their business objectives, data source technology, and governance requirements, most pipeline projects share a consistent sequence of steps. An extract, transform, load (ETL) application is another name for a data pipeline. A clear understanding of these steps enables engineers to deliver more predictable outcomes, better maintainability, and smoother scaling as data volumes and use cases expand.
1. Project organization
A data pipeline project, like all projects, will be more successful with a reasonable amount of initial planning. The likely steps include:
- Identifying an executive champion with a relevant data problem.
- Writing a project charter that describes the project, including resources, likely budget and approximate schedule.
- Seeking approval for the project charter from the executive champion and the steering committee.
- Assembling a project team that includes multiple disciplines.
2. Requirements definition
Successful data pipelines begin with a detailed project requirements definition that articulates:
- Business objectives that focus on expected analytics and insights.
- The list of data sources. A simple data pipeline will have fewer than 5 data sources.
- The list of data consumers. It often includes engineers, business analysts, and data scientists.
- The list of consuming applications, data visualizations, dashboards and reports.
- The expected service level. A frequent expectation is that a data pipeline will be back in operation after an outage within 1 day.
- The acceptable data quality standard.
“When building data pipelines, many organizations fail to include sufficient functionality for data pipeline monitoring,” says Nelson Petracek, formerly the Chief Technology & Product Officer at Board International, a leading business intelligence and performance management software vendor. “Monitoring ensures that the expected data actually arrives where it needs to be in the correct time frame, and with the right level of accuracy. Without monitoring, the organization cannot depend on their data. Lack of confidence eliminates most of the benefits associated with deploying a data pipeline.”
The cost and elapsed time of a data pipeline project are largely proportional to the number of data sources, the complexity of the data transformation and the number of artifacts to be produced.
3. Data source inventory
Teams determine what data the pipeline needs, where it originates, and how it will be used. The data storage technology of the data sources often varies significantly. Examples include databases, files, exports from SaaS applications, real-time records from Industrial Internet of Things (IIoT) devices, and data from external vendors.
When engineers implement data pipelines, they often start by bringing together manufacturing and financial data. Later, they may add real-time IIoT data from the production process, along with performance metrics such as quality, yield, and schedule.
This step sets the foundation for design decisions regarding data transformation logic, latency, and storage choices.
4. Data ingestion and connectivity design
With requirements and a data source inventory in place, attention turns to how the pipeline will bring data into the staging environment. The two connectivity strategies are:
- Batch pipelines run scheduled extracts, often overnight, using connectors supplied by software vendors or API calls.
- Streaming pipelines use message brokers or event capture services to ingest data in near real time.
Building streaming pipelines requires more skill than batch pipelines.
The data ingestion design will consider authentication, in-transit encryption, and the avoidance of data duplication.
5. Source data management
The next step in building a data pipeline is to copy data from all the data sources to the staging environment, often a data lake. Preserving the original data supports auditability, replay, and recovery.
Metadata collection for file sizes, schema definitions, load times, and quality indicators typically occurs during the copy of source data to the data lake.
6. Data profiling and quality assessment
Before transforming the data in the pipeline, the team examines the data from each data source to understand its structure, patterns, completeness, and anomalies. Teams use automated profiling tools to generate statistics on:
- Null distribution. Nulls indicate data quality lapses.
- Mismatches of key values. Mismatches prevent joins required for data transformation.
- Unlikely values for non-key data columns. Unlikely values indicate data quality lapses.
- Frequency of outlier values. Outlier values suggest possible data quality lapses.
- Data type inconsistencies. These prevent comparisons in queries.
These data issues manifest themselves in misleading query results that undermine engineers’ confidence in the associated recommendations.
These assessments identify the required cleansing rules and provide an initial estimate of the effort needed to standardize and clean the data. Establishing data quality expectations early in the pipeline reduces future rework and provides transparency to end-users.
The cost and complexity of data profiling and quality assessment are largely proportional to the number of data columns of interest across all data sources.
7. Data cleansing and standardization design
With data quality insights in hand, the project proceeds to shape the pipeline data for integration and analytics. The design involves defining rules for:
- Recognizing and removing duplicates.
- Harmonizing data types.
- Standardizing data values for keys, reference codes and units of measure.
- Filling in missing values and correcting unlikely values where possible, and reporting them when it’s not.
The design also includes the report for the number of:
- Rows processed.
- Changes made.
- Changes attempted but not made.
This sequence of steps, and the ones in the follow-on article, describes a disciplined, repeatable approach to building robust, scalable pipelines aligned with business needs.