The Data Literacy Series - Data Quality

Post by
Salma Bakouk
The Data Literacy Series - Data Quality

Because it’s 2021 and everyone cares about data. This series is aimed at helping everyone (yes, everyone) get accustomed to some of the key concepts of a modern data stack and why it is so important to cover the bases before you start investing in fancy data analytics tools.

We start the series by what we think is the most important challenge of a data driven company: Data Quality. If you’re struggling to rely and trust your data, the dashboard and all the insights generated from your analytics are void. 

So What is Data Quality? Why is it Important? How to measure it? And 5 best practises. 

Data Quality - as the name suggests - refers to a measure of the health state of your data. Much like in manufacturing or in the service industries, bad data quality = bad business. Let me give you an example to help you understand.

Brian, a data engineer works at a Grocery Delivery service, during the 1st pandemic lockdown the business was thriving. One day, and despite his best efforts “manually” making sure the data pipelines were error free the unforeseeable happened. Data was accidentally duplicated, which resulted into a duplication of order items which in turn led to customers receiving double the quantity they ordered. Brian only knew about it when customers started reporting the issue, and he was a couple dozen orders too late… The company ultimately locked in multiple k$ loss from this incident which was pretty significant given the size of their business. 

Unreliable Data can quickly become a massive source of pain and can be detrimental to the business. Think missed opportunities, financial costs, customer dissatisfaction, failure to achieve regulatory compliance, inaccurate decision making etc just to cite a few.

Maybe you can relate to Brian (I know I can) or maybe you’ve been getting on with the help of multiple fixes / testing tools kindly developed by your team of data engineers to detect “recurrent” issues, emphasis on recurrent (OK but definitely not scalable). One thing we can both agree on: if you can’t trust your data to make sound business decisions you’ve got a serious problem.  

So how exactly do you know if your data isn't trustworthy? Much like in relationships, you start having problems and ask a lot of unanswered questions: Is my data Fresh? Is my data complete? Are my field values as expected? Why do I have “null” here? Where can I find the data? How was the data computed? etc etc

  • Accuracy: This goes without saying, the data needs to be correct, duplicate-free and in the expected format.
  • Completeness: No missing values, missing data records, incomplete pipelines.
  • Freshness / Timeliness: the data should be up to date.
  • Relevance: The data should be relevant and intended for a business use.
  • Consistency: The data should be in the expected format and homogenous across the organization. 

Now you ask, how can you prevent bad data quality from undermining your business?

Here are 5 best practises 

  1. Establish rigorous control of incoming data and introduce data profiling: In most cases data quality issues start even before the data starts being used, as often the data is received from another organization or even a third party software. 
  1. Have a clear definition of your business needs and KPIs to make sure the data produced meets the business requirements. An important aspect of data quality is its relevance. Make sure you monitor data that matters.
  1. Evangelise data quality within your teams: Data quality needs to be cultural and respected “religiously”. Define quality testing metrics and assign ownership over them for each of your projects.
  1. Invest in the right tools to monitor quality at scale. Short term fixes will only tackle a certain scope of issues. Data quality issues are as important as Software bugs, you can’t just invest in infrastructure you need to invest in maintenance as well. 
  1. and finally, Think of integrating Data Lineage and Pipeline traceability to save time and effort troubleshooting.  

Get in touch in you want to learn more, or