DATA PROFILING
Data profiling (also known as data assessment, data discovery or data quality analysis) is a process of examining the data available in an existing data source (such as database) and collecting statistics and information about it. It is also defined as a systematic up front analysis of the content of the data source. It is a first step of improving data quality.
Before starting to look for data anomalies first we will learn about the data quality.
- We say that the data has a high quality if its:
- Correct - consistent with reality.
- Unambiguous - has only one meaning.
- Consistent - use only one convention to convey its meaning.
- Complete - it has no missing values or null values.
The main task of the data profiling process is identifying the data anomalies and problems that may require cleanup or conforming process before delivering it.
The most common anomalies are: missing data , inconsistent data , duplicate records.
Data cleansing has also additional rules like:
- Checking the relationships between tables (relationship between foreign key and primary key)
- Checking complex business rules.
- Making you more familiar with the data source
It is a very important process because bad quality of the data may cause bad business decisions and increase costs of missed chances for example in sale. It will analyze your data, helps you with better understanding of the data source and prevent problems.
Data profiling process
The data profiling process may consist of many steps, such as:
- Create the Project Initiation Document. It contains the deliverables , boundaries and time of the project. The team should be familiar with this document to avoid spending precious time on performing unnecessary tasks. This document shows also what are the expectations and requirements.
- Choose appropriate tools- analytical and statistical tools which allow you to outline the quality of the data structure. Those tools show which elements of data are the most important(have a high range) or what is the frequency of elements.
- Analyze data sources
- Determine the scope of data.
- Identify the variation of data patterns and formats in the database.
- Identify multiple coding ,redundant values , duplicates , null values , missing values and other anomalies that appear in the data source.
- Check the relationship between primary and foreign key and discover how this relationship influence the data extraction.
- Analyze business rules
- Column profiling-analyze values in each column , discover problems with metadata or content quality.
- Single-table structural profiling-looks for relationships between columns , discovers problems with primary keys and data structure quality.
- Cross-table structural profiling-compares data between tables , looks for overlapping values , duplicate values , analyze foreign keys.
As we can see the data profiling process is necessary. Data in/from every source should be analyzed because It will help avoid later problems during performing business decisions. There is also one thing that you should remember - without the data profiling report you cannot build effective cleaning and conforming system.