Invert-Logo

Best practices for developing data-integration pipelines

Oct 31, 2018 10:00:00 AM

 

Mumbai, October 31, 2018: Data-integration pipeline platforms move data from a source system to a downstream destination system. Because data pipelines can deliver mission-critical data and for important business decisions, ensuring their accuracy and performance is required whether you implement them through scripts, data-integration and ETL (extract transform, and load) platforms, data-prep technologies, or real-time data-streaming architectures.

When you implement data integration pipelines, you should consider early in the design phase several best practices to ensure that the data processing is robust and maintainable. Whether you formalize it, there’s an inherit service level in these data pipelines because they can affect whether reports are generated on schedule or if applications have the latest data for users. There is also an ongoing need for IT to make enhancements to support new data requirements, handle increasing data volumes, and address data-quality issues.

If you’ve worked in IT long enough, you’ve probably seen the good, the bad, and the ugly when it comes to data pipelines. Figuring out why a data-pipeline job failed when it was written as a single, several-hundred-line database stored procedure with no documentation, logging, or error handling is not an easy task. So, when engineering new data pipelines, consider some of these best practices to avoid such ugly results.

Apply modular design principles to data pipelines

As a data-pipeline developer, you should consider the architecture of your pipelines so they are nimble to future needs and easy to evaluate when there are issues. You can do this modularizing the pipeline into building blocks, with each block handling one processing step and then passing processed data to additional blocks. ETL platforms from vendors such as Informatica, Talend, and IBM provide visual programming paradigms that make it easy to develop building blocks into reusable modules that can then be applied to multiple data pipelines.

Moustafa Elshaabiny, a full-stack developer at CharityNavigator.org, has been using IBM Datastage to automate data pipelines. He says that “building our data pipeline in a modular way and parameterizing key environment variables has helped us both identify and fix issues that arise quickly and efficiently. Modularity makes narrowing down a problem much easier, and parametrization makes testing changes and rerunning ETL jobs much faster.”

Other general software development best practices are also applicable to data pipelines:

  • Environment variables and other parameters should be set in configuration files and other tools that easily allow configuring jobs for run-time needs.
  • The underlying code should be versioned, ideally in a standard version control repository.
  • Separate environments for development, testing, production, and disaster recovery should be commissioned with a CI/CD pipeline to automate deployments of code changes.
Validate the accuracy of data throughout the pipeline

It’s not good enough to process data in blocks and modules to guarantee a strong pipeline. Data sources may change, and the underlying data may have quality issues that surface at runtime. To ensure the pipeline is strong, you should implement a mix of logging, exception handling, and data validation at every block.

When implementing data validation in a data pipeline, you should decide how to handle row-level data issues. How you handle a failing row of data depends on the nature of the data and how it’s used downstream. If downstream systems and their users expect a clean, fully loaded data set, then halting the pipeline until issues with one or more rows of data are resolved may be necessary. But if downstream usage is more tolerant to incremental data-cleansing efforts, the data pipeline can handle row-level issues as exceptions and continue processing the other rows that have clean data.  

Many data-integration technologies have add-on data stewardship capabilities. This let you route data exceptions to someone assigned as the data steward who knows how to correct the issue. These tools then allow the fixed rows of data to reenter the data pipeline and continue processing.  

If you’re working in a data-streaming architecture, you have other options to address data quality while processing real-time data. Sanjeet Banerji, executive vice president and head of artificial intelligence and cognitive sciences at Datamatics, suggests that “built-in functions in platforms like Spark Streaming provide machine learning capabilities to create a veritable set of models for data cleansing.”

Establish a testing process to validate changes

At some point, you might be called on to make an enhancement to the data pipeline, improve its strength, or refactor it to improve its performance. You’ll implement the required changes and then will need to consider how to validate the implementation before pushing it to production.

What can go wrong? Plenty: You could inadvertently change filters and process the wrong rows of data, or your logic for processing one or more columns of data may have a defect.

Think about how to test your changes. One way of doing this is to have a stable data set to run through the pipeline. With a defined test set, you can use it in a testing environment and compare running it through the production version of your data pipeline and a second time with your new version. You can then compare data from the two runs and validate whether any differences in rows and columns of data are expected.

Engineer data pipelines for varying operational requirements

Data pipelines may be easy to conceive and develop, but they often require some planning to support different runtime requirements.

First, consider that the data pipeline probably requires flexibility to support full data-set runs, partial data-set runs, and incremental runs. A full run is likely needed the first time the data pipeline is used, and it may also be required if there are significant changes to the data source or downstream requirements.

Sometimes, it is useful to do a partial data run. Maybe the data pipeline is processing transaction data and you are asked to rerun a specific year’s worth of data through the pipeline. A strong data pipeline should be able to reprocess a partial data set.

The steady-state of many data pipelines is to run incrementally on any new data. This implies that the data source or the data pipeline itself can identify and run on this new data.

Because data pipelines may have varying data loads to process and likely have multiple jobs running in parallel, it’s important to consider the elasticity of the underlying infrastructure. Running data pipelines on cloud infrastructure provides some flexibility to ramp up resources to support multiple active jobs. If your data pipeline technology supports job parallelization, use engineering data pipelines to leverage this capability for full and partial runs that may have larger data sets to process.

You May Also Like

These Stories on Media Articles

Subscribe by Email