09 Feb 2017 Data Quality Series – Introduction
This article is the first in a series of blog posts about the topic “Data Quality“. In the next couple of weeks we will go through the following subject matters:
Data quality with EDQ
Part 1: Data Profiling
Part 2: Data Standardization
Part 3: Data Deduplication
Data quality with Informatica
Part 1: Data Profiling
Part 2: Data Standardization
Part 3: Data Deduplication
The focus of this first article is to introduce “Data Quality”.
1. Introduction to Data Quality
Data quality is the perception or assessment of the fitness of data to serve its purpose in a given context [1].
Good quality data is crucial for analysts to create reports that contain accurate and consistent information, and in some businesses, bad quality or out-of-date data may increase costs. For example, a mailing campaign that sends letters to the wrong customer addresses is a waste of money. Moreover, reports containing poor quality data may be regarded by customers as erroneous and thus reduce their confidence in the delivered dashboards. In fact, according to Gartner, a loss of confidence by users in their reports/dashboards is the number 1 cause of Data Warehouse / Data Mart / Data Governance project failures.
Many of the data quality tasks are very close to the database level and can be performed by a DBA, for instance by adding checks in the columns to allow just valid values, or by setting default date formats. But in some scenarios we may find that these validations are not performed, and certain data quality tools can help us to analyse and fix these issues. The term “Data Quality” involves many different aspects:
Validity: The data conforms to the syntax (format, type, range) of its definition. Database, metadata or documentation rules as to the allowable types (string, integer, floating point etc.), the format (length, number of digits etc.) and range (minimum, maximum or contained within a set of allowable values).
Accuracy: The data correctly describes the “real world” object or event being described. Does it agree with an identified reference of correct information?
Reasonableness: Does the data align with the operational context? For example, a birthdate of 01/01/01 is valid, but is it reasonable in context?
Completeness: The proportion of non-blank values against blank values. Business rules define what “100% complete” represents: optional data might be missing, but if the data meets the expectations it can be considered complete.
Consistency: Values across all systems in an organization reflect the same information and are in sync with each other.
Currency: The data is current and “fresh”; data lifecycle is a key factor.
Precision: The level of detail of the data element, e.g. the number of significant digits in a number. Rounding, for example, can introduce errors.
Privacy: The need for access control and usage monitoring.
Referential Integrity: Constraints against duplication are in place (e.g. foreign keys in a RDMBS)
Timeliness: The data is available for use when expected and needed.
Uniqueness: No value occurs more than once in the data set. Data quality is affected by the way data is entered, stored and managed, and data quality assurance (DQA) is the process of verifying the reliability and effectiveness of data. In the section below we will see the basic approach to a successful DQA project.
2. Data Quality Projects
Every data quality project needs to start with an understanding of the benefits that we are going to obtain from it, that is, an assessment and understanding of the value of the data as a key for better decision-making and improved business performance, or, in other words, its utility. It is possible to analyse how data is being used to achieve business objectives, and how the achievement of these goals is impeded when flawed data is introduced into the environment. To do this, we must consider the following points:
• | What the business expectations for data quality are |
• | How the business can be impacted by poor data quality |
• | How to correlate such impacts with specific data quality issues |
Once a negative impact on the ways the business operates due to poor quality data has been determined, the necessary approach to assemble a data quality management programme and institute the practices that will lead to improvement must be planned. This plan must consider:
• | The processes that need to be instituted |
• | The participants that will execute those processes and their roles and responsibilities |
• | The tools that will be used to support the processes |
Normally, a data quality project lifecycle involves at least three different steps:
1. | Data Profiling |
2. | Data Standardization or Cleansing |
3. | Data Matching and Deduplication |
3. Data Profiling
Data profiling is the analysis of data in order to clarify its structure, content, relationships and derivation rules. It mainly involves gathering different aggregate statistics or informative summaries about the data, and ensuring that the values match up to expectations. Profiling helps not only to understand anomalies and to assess data quality, but also to discover, register, and assess enterprise metadata; thus the purpose of data profiling is both to validate metadata when it is available and to discover metadata when it is not. The typical outputs of the data profiling process are: Column
Profiling:
• | Record count, unique count, null count, blank count, pattern count |
• | Minimum maximum, mean, mode, median, standard deviation |
• | Completeness & number of non-null records |
• | Data types |
• | Primary key candidates |
Frequency Profiling:
• | Count/ratio of distinct values |
Primary/Foreign Key Analysis:
• | Referential integrity checks (can be achieved by creating a join profile) |
• | Candidate primary and foreign keys |
Duplicate Analysis:
• | Identify potential duplicate records |
Business Rules Conformance:
• | The data meets an initial set of rules |
Outlier Analysis:
• | Identify possible bad records |
4. Data Standardization
Once the data profiling process is finished, the next step is to fix the issues that have been identified, by applying rules, replacing the wrong values, and fixing data inconsistencies. Data standardization, or data cleansing, is the process of developing and applying technical standards to the data, to maximize its compatibility, interoperability, safety and quality.
5. Data Matching and Deduplication
Record matching is the process of finding duplicate records in the data, that is, records which may relate to a single real world entity. This may seem like a trivial task, but in reality it is far from it.
The first challenge is to identify the criteria under which two records represent the same entity, considering that the data may come in different formats and using different conventions (free text fields), may be incomplete and/or incorrect (typos, etc.), and the context of the different fields (for example, four different fields may represent a single address). We want to identify duplicates regardless of these differences in the data, so we need to define measures of distance between the records and then apply rules to decide if they classify as duplicates. Moreover, efficient computation strategies need to be used to find duplicate pairs in large volumes of data.
The number of comparisons needed using traditional methods on large datasets quickly scales up to extremely long execution times, and techniques such as a previous clustering of the data become necessary. Data deduplication is the process of merging record pairs representing the same real world entity, also known as consolidation. It relies on record matching to find pairs of possible duplicate records.
6. Data Quality Tools
All the tasks involved in a data quality assurance project can be done manually on the database, or in the ETL processes if working on a data source being loaded from another source. However, there are many vendor applications on the market to make these tasks much easier. Most of these applications can be integrated with other ETL tools, and offer batch processing as well as real-time processing to ensure we always have the best quality data. In the next article we are introducing Oracle Enterprise Data Quality (EDQ), a leading data quality tool.
References: [1] Margaret Rouse, “data quality”. Web. TechTarget. November 2005. Accessed January 2017. http://searchdatamanagement.techtarget.com/definition/data-quality Authors: Javier Giaretta, Nicolas Ribas and Axel Bernaus Click here if you would like to know more about Data Quality!