carol.jpg
jove.jpg

By Carol Jang & Jove Kuang


Just like how DevOps changed the way we develop software, DataOps is changing the way we create data products. While these are both based on agile-framework, DataOps can be more challenging to tackle since it involves everyone who works with data and the data requires much more iteration as it changes.

By leveraging DevOps methodologies, companies and teams have achieved speed, quality and flexibility in creating and maintaining software products. DataOps has those same goals in mind for teams building data products - employing a similar Delivery Pipeline. But the pipeline does have differences as it relates to the implementation of build, test, and release, which we will dive into.

 
Based on: https://aws.amazon.com/devops/what-is-devops/

Based on: https://aws.amazon.com/devops/what-is-devops/

 

Differences in the DataOps Delivery Pipeline

1. Build (Speed)

To build a new piece of software, you need a software developer (or a team of them) who is proficient in the relevant programming languages and who understands the purpose of the software in terms of features and functions. For example, let’s say you need a button that uploads documents. The engineer building this just needs to know the desired function. They don’t need to understand the contents of the document being uploaded or the business context of why it’s being uploaded in the first place. The developer’s responsibility is just to put the right code in place to make sure the button gets the document uploaded.

When building a new data product, it’s no longer about features and functions but about metrics and goals that not only require proficiency in the relevant coding languages but also a deep understanding of the underlying data and business context. The contents of every data file and how it’s used becomes massively important! For this reason, data engineers can’t just build data products on their own; they must collaborate with data scientists and analysts to get the business context.

 
build.png
 

Let’s take a social media company that wants to compute its user lifetime value (LTV). This requires business expertise from analysts and data scientists to understand what data is necessary to compute this and request this from data engineering. Then the data engineers prepare the data to a clean format. Data scientists and analysts can then bring their domain expertise to build the right models and analysis to compute this. Next the data engineers need to orchestrate the right infrastructure and configurations to run these jobs created by data scientist. As a result, the user LTV as data outputs combines skills and knowledge from both the data engineer and the data scientists and analysts to determine

  • which data is relevant;

  • what code and tools are needed to connect, transform, and retrieve the data; and

  • how to extract valuable insights from the data to help the business.

These constant pass-offs during build can hinder the speed of development. DataOps methodologies need to be designed with this required cross-team collaboration in mind to meet the speed demands for data across the business.

2. Test (Quality)

Developers test their software by observing whether the correct output is produced based on the given input. There is a clear cause and effect relationship to help determine the quality of the software. Based on the previous software example, if clicking the button uploads the selected document to the correct location, the test is a success.

 
test software.png
 

Now, if we were testing the user LTV calculation in the previous data product example, how would we determine success if the company has never computed its user LTV before? The test isn’t over simply if the data product spits out a number. That number must be validated. This can be done by comparing the results to that of another data product calculating the same metric using a different methodology or via other cross-validation and validation tests.

 
test data product.png
 

The quality of the data product is unclear until there is other evidence to support its results. Oftentimes, more iterations are required to fine-tune the results as data changes and more evidence is uncovered. The ability to iterate and maintain quality is critical to the value a data product delivers.

3. Release (Flexibility)

release.png

When developers release new software, the deployment is deemed a success once all the features are live and in-use. The software has reached the end of the Delivery Pipeline and enters Feedback Loop during which developers monitor and plan for the next update or version. On the other hand, the end users simply enjoy the features and may or may not choose to provide feedback; their involvement in the Feedback Loop is indirect and voluntary.

With data products, the inputs and outputs are constantly changing; just because a data product worked during test and release, doesn’t mean it will continue to work. Formats change, new columns get added, labeling shifts over time. As the data inevitably changes, so should the data product. Also, as the business becomes smarter, so should the data product -- possibly by adding other data points to refine its analyses. The individuals who best understand all these changes also happen to be the end users: data scientists and analysts. Thus, in DataOps, end users’ involvement in the Feedback Loop is direct and mandatory; they must work closely with data engineers to ensure the data products continue to improve and run smoothly.

In the previous data product example, if the data analytics team at the social media company learned that a user’s number of active followers has a significant impact on their LTV, they will want to add this variable to the user LTV calculation. In this case, they will work closely with the data engineering team to make sure the right data needed to calculate the number of active followers per user is added to the data product.

From a data engineering perspective, without DataOps, changes like this can involve lengthy re-designs, re-models, major reprocessing, and backfilling of the data product. With DataOps, there’s an understanding that things change; a data product is never truly “complete” and flexibility is a must-have for it to continue being useful to the business.

To summarize, DataOps and DevOps are different because the former optimizes for data products and the later optimizes for software.  Building a successful data product requires different areas of expertise and, therefore, is a cross-team collaboration. Testing data products requires finding evidence to validate the results and fine-tuning to achieve the right results. And, lastly, releasing data products is an ongoing, collaborative effort because the data will change and the data product must adapt to accommodate those changes.

But that’s just our two-cents.  What differences have you experienced when implementing DataOps vs. DevOps?  Care to share?