By Andy Petrella, Founder & CPO, Kensu
Most of us have experienced, at some point, data quality issues. They manifest themselves as broken dashboards, automated reports that don’t add up, or malfunctioning AI models that make nonsensical predictions. Sometimes those with a keen eye for detail who scrutinize complex data plots or reports can find things that don’t make sense. But, unfortunately, most data issues never get noticed, allowing corrupted information to continue getting published, consumed, and acted upon.
Data quality issues have been occurring ever since data have been collected, and now that more data is being collected, this problem is just getting worse.
The road to better data
Knowing that there is a data issue is just the first step. Next comes locating it to its root cause and then addressing it. Locating and addressing data quality issues takes time as many stakeholders are involved i.e., those who collect the data, process the data, and finally those who consume the data.
Systematically going through the lines of code of all the applications that make up the data pipeline takes time, effort, and patience. Sometimes the root cause is located, and the issue is addressed. Other times quick fixes are put in place only to cause more problems in the future.
A solution to these issues is to include ‘observability’ in the data pipeline, which gives users a total view of their data pipeline health.
This is achieved by automating the tasks of:
- Monitoring the quality of data within data pipelines.
- Recording where the data comes from.
- Logging how the applications in the pipeline process it.
- Recording where the data is used.
Observability solutions also allow rules to be set that will trigger alerts when something unexpected happens along the data pipeline. This allows data teams to address issues in a timely manner before they propagate through to reports, dashboards, or models.
All this ‘observability’ information and infrastructure empower those responsible for data pipelines, typically data engineers, to deliver data to quality and on time. If there is a data issue, the data engineers will be the first to know about it, not the consumers of reports, dashboards, or models.
Data observability helps address, in no particular order, the top four causes of data issues:
- Inconsistent data
- Old data
Most organizations have to deal with multiple, simultaneous data quality issues. They might have too many different data sources with a combination of the four of the top causes of these issues. All this takes time and effort to resolve, both typically in short supply. Including observability in the process will greatly reduce the occurrences of data quality issues.
Now that we’ve seen how to resolve the challenge of data quality, let’s look at what happens when data quality is not taken seriously.
The dangers of bad data
Bad data means false reports, error-filled dashboards, and nonsensical model predictions. Decisions based on wrong information result in less-than-optimal strategies at the highest of levels, for example:
- Investing in the wrong sectors.
- Reducing production as opposed to increasing production.
- Reducing capacity instead of increasing capacity.
- The list goes on…
This will inevitably lead to a lack of confidence in the future in executing the recommendations of the models or actioning on the conclusions of reports. Decision makers end up being overly cautious and delay acting on intelligence offered by reports if they ever do act on them.
The power of data observability
We’ve seen how bad data can propagate through a system and have serious consequences. Including observability into the data infrastructure can help minimize the number of data issues and help with swift resolution.
Along with some form of observability implementation, here are best practices that the most successful data-driven teams have also implemented:
- A process and data-driven culture.
- Automation of as many tasks as possible to reduce errors.
- Deployments are automated and executed only when automated tests are passed.
This article only scratches the surface of data observability and quality; for a more detailed review, please check out The Fundamentals of Data Observability published by O’Reilly.
Andy Petrella is the founder and CPO of Kensu, a data observability solution that helps data teams trust what they deliver and create more value from data. He is also the author of “Fundamentals of Data Observability.”