As a continuation to our Data Literacy Series, this time we tackle Data Lineage.
So What is Data Lineage? Why does it matter? And how to get started with Data Traceability & Lineage and pick the right tool?
As data becomes at the heart of modern decision-making, companies are striving to collect more and more of it in an effort to drive innovation and overall improve processes and operations. And therefore, many invest in several solutions with multiple entry points and transformation rules as the data flows into the system. But at what point does it become too much of a good thing? How do you make sure the data is helping your teams instead of overwhelming them? How do you ensure proper lineage and traceability of your data?
Data Lineage refers to the process of mapping the journey of your data from its origin, through the different transformations it undergoes and the multiple processes it flows through, to where it ultimately goes and the areas it feeds into.
Typically, Data Lineage provides answers to the below questions:
Let’s take a pretty timely illustrative example.
In order to trace Covid-19 variants of concern (VOC), scientists are monitoring virus mutations as they are passed down through a lineage, which in this case is a branch of the viral family tree. If we take the B.1.1.7. variant of SARS-CoV-2, the lineage also known as the UK variant, studies have identified over 7 mutations within the lineage that are responsible for making it 30 to 50 percent more infectious than other variants in circulation today.
Understanding the possible mutations in the journey of the variant is important to public health and the effectiveness of the vaccination programs. In the same regard, understanding the origin of the data and the different transformations it undergoes is instrumental in a data driven business approach.
Let’s take another example just to be sure.
Remember Brian from our Data Quality post? Sophie, a data engineer in the same company, gets alerted by Brian about the duplication issue and starts investigating. While trying to trace back the order fulfillment process, in an attempt to identify the root cause, she realises how complex the dependencies are between data sets and feels like she's going to spend hours trying to figure it out. Sophie was under immense pressure from multiple stakeholders at that time (operations, finance, leadership, the whole crew!) which didn't help...Imagine if Sophie had a visual representation of the overall flow of data, all the vital pieces, the dependencies shown clearly, she would’ve spent significantly less time getting to the bottom of the issue.
So exactly how does having automated Data lineage help your company?
So we can agree now that Data lineage is paramount to the success of a data driven business, but where do you start you ask?
Here are our 3 tips to help you decide which tool.
And finally, I would strongly suggest you invest in data quality and monitoring. Lineage won't work if you have bad data in the first place.
Get in touch in you want to learn more, firstname.lastname@example.org or email@example.com.