The Data Literacy Series - Data Lineage

Post by
Salma Bakouk
The Data Literacy Series - Data Lineage

As a continuation to our Data Literacy Series, this time we tackle Data Lineage.

So What is Data Lineage? Why does it matter? And how to get started with Data Traceability & Lineage and pick the right tool?

As data becomes at the heart of modern decision-making, companies are striving to collect more and more of it in an effort to drive innovation and overall improve processes and operations. And therefore, many invest in several solutions with multiple entry points and transformation rules as the data flows into the system. But at what point does it become too much of a good thing? How do you make sure the data is helping your teams instead of overwhelming them? How do you ensure proper lineage and traceability of your data?

Data Lineage refers to the process of mapping the journey of your data from its origin, through the different transformations it undergoes and the multiple processes it flows through, to where it ultimately goes and the areas it feeds into. 

Typically, Data Lineage provides answers to the below questions: 

  • Where does the data come from? Where is it located? And What does it feed into?
  • Who created it? At what point? And What was the purpose? 
  • What information does the data contain? 

Let’s take a pretty timely illustrative example.

In order to trace Covid-19 variants of concern (VOC), scientists are monitoring virus mutations as they are passed down through a lineage, which in this case is a branch of the viral family tree. If we take the B.1.1.7. variant of SARS-CoV-2, the lineage also known as the UK variant, studies have identified over 7 mutations within the lineage that are responsible for making it 30 to 50 percent more infectious than other variants in circulation today. 

Understanding the possible mutations in the journey of the variant is important to public health and the effectiveness of the vaccination programs. In the same regard, understanding the origin of the data and the different transformations it undergoes is instrumental in a data driven business approach.

Let’s take another example just to be sure.

Remember Brian from our Data Quality post? Sophie, a data engineer in the same company, gets alerted by Brian about the duplication issue and starts investigating. While trying to trace back the order fulfillment process, in an attempt to identify the root cause, she realises how complex the dependencies are between data sets and feels like she's going to spend hours trying to figure it out. Sophie was under immense pressure from multiple stakeholders at that time (operations, finance, leadership, the whole crew!) which didn't help...Imagine if Sophie had a visual representation of the overall flow of data, all the vital pieces, the dependencies shown clearly, she would’ve spent significantly less time getting to the bottom of the issue.

So exactly how does having automated Data lineage help your company?

  • Business Impact: Having a visual mapping of your data allows your teams to work together efficiently and can help drive better business decisions, by allowing everyone to be part of the data journey and understand its implications to each department. 

  • Data Privacy Regulations & Data Governance: An effective data lineage will allow your organisation to conduct efficient auditing and be better positioned to comply with recent regulation and risk management requirements. Such as GDPR, US HIPAA, and industry requirements such as PCI DSS, BCBS 239 and MiFID II for financial institutions.

  • Improved Processes & Operations: By optimising error detection and resolution, assisting with impact analysis and overall improving software systems. 

So we can agree now that Data lineage is paramount to the success of a data driven business, but where do you start you ask? 

Here are our 3 tips to help you decide which tool.

  1. Goes without saying, your tool needs to integrate easily within your data stack, without requiring additional heavy lifting from your engineers. It needs to be your ally and not an additional burden. 


  1. Your lineage tool needs to be user friendly, and not just according to your data engineers. The whole point behind lineage is to make sure everyone in your organisation gets the full picture around data, it is equally important for it to be comprehensible for Data Engineers as it is for Data Scientists and Data Analysts. 

  1. Think of your lineage in 3 pillars: {Sources,Transformations,Targets}, it needs to be computed end-to-end from data sources through data pipelines to your data storage. 

And finally, I would strongly suggest you invest in data quality and monitoring. Lineage won't work if you have bad data in the first place.

Get in touch in you want to learn more, or