Pipeline Optimization Considerations
By Siddharth Panicker
In this article, we’re going to dig into what DataOps means to a data engineer as it relates to optimizing data pipelines across an enterprise.
The challenge with optimizing a data pipeline at its inception is already fairly complex. You have a much bigger challenge on your hands when these optimizations need to be updated and maintained, not just to serve the volume, velocity, and variety of data for your initial pipeline, but also for new pipelines that need to be created to provision new datasets for analytics and data science teams.
So how do you tackle these challenges? Let’s take a deeper look at some optimization techniques and examples from a DataOps mindset, that you can relate to from your experience as a data engineer or working with them.
Let’s start with the optimizations you as a data engineer would need to perform for data to be read into or written by a pipeline.
Data formats, Compression and Data Type Conversions:
What are the formats of data and types of compression that the pipeline can support out of the box or with minimal code?
We often see data engineers writing manual or custom parsing routines to support common data formats, compression, and to do data type conversions. Doing this for every pipeline can be highly error-prone and time-consuming. Embracing automation for these parsing routines is critical to keep pipelines running and accurate.
Optimize for speed of ingress/egress:
Does the pipeline utilize information about the structure and storage of the data at the source and target to optimize for speed of ingress/egress?
For example, when deciding if a record needs to be processed or not, does it analyze metadata at the source object (such as a file, table or data partition) level or at the row level? If data is being read from a proprietary system such as an RDBMS or an MPP database, is the pipeline leveraging the technical advantages provided by these systems, such as parallelism, sharding, out-of-the-box change data capture, and native connectivity? Optimizing for speed of ingress/egress often has the most impact on the overall probability of pipelines meeting SLAs for data delivery to downstream systems.
For large initial loads or backfills:
While building a pipeline, there are usually a series of transformations performed on the data such as filtering, aggregating, sampling, joining and windowing. These transformations need to be optimized for the context of the data structures and the business logic. At this step, you need to perform several tasks:
Design partitioning strategies
Optimize transformation design
Filtering data early on in the pipeline to reduce overall data movement
Using the right data types for intensive operations
Forward projection of only necessary columns
Redistribution of data across partitions to ensure both performance and accuracy of the results
For Incremental loads:
In addition to the optimizations developed for the backfills, a separate set of optimizations need to be built for incremental updates to the dataset. A few examples are:
Storage and usage of column statistics for incremental operations to reduce the amount of data scanned for repetitive transformations being run for every incremental update to the dataset.
Partitioned joins to restrict the amount of data going into a join transformation to only process the updated data.
Embracing repeatable and consistent processes to optimize transformations contributes the most to data engineering efficiency, as this can be time-intensive, and sadly most of it tends to be highly contextual and exclusive to each pipeline.
For Independent Pipeline Deployments:
We’re going to assume your choice has been made about on-prem, cloud, data processing platform, and other physical considerations for compute. When deploying a pipeline for the first time, based on the volume of data and the types of transformations, you could decide the number of compute instances, memory, storage, cache, data sharding, and other resource related artifacts. Usually this is done through a series of trials and errors before arriving at the optimal configuration that would allow the pipeline to complete in a reasonable amount of time and cost. This becomes an ongoing activity for pipelines that tend to be frequently updated both in terms of logic as well as addition of new data sets.
For Multi-tenant Deployments:
Once a strategy has been developed to optimize independent pipelines for resources, your organization’s data platform will usually have several teams deploying these pipelines. If not well planned, the ensuing resource contention can cause disruption and delivery issues across these teams. In this case, a separate strategy for optimizing resources will need to be developed factoring in each pipeline’s criticality to the business, data freshness needs for the business teams, service level requirements, and overall compute budgets assigned to individual teams or business units as part of an overall IT infrastructure strategy.
A mature DataOps strategy should have processes defined to build the aforementioned optimizations with repeatability, speed, consistency, and quality to improve the time-to-market for data products needed by the business. Have you seen any interesting pipeline optimizations as part of a DataOps strategy that we missed? We’d love to hear from you.