data validation

What is Clean Data?


Before we dig into data validation, it's important to understand what clean data is. Clean data is good data, valid data, and data that actually belongs in a database or table. There are a number of ways that dirty data can get into a database, such as:

1. Users entering invalid characters,
2. It could come from front-end processes that aren't accurately completed,
3. It could be a result of back-end processes (i.e. stored procedures) not accurately carrying data over from one db to another, or
4. System integration where many different systems are in use; especially where legacy systems are working with newer systems that weren't built to operate together

Regardless of how it gets there, an analysis is only as good as the data... and dirty data will render any analysis worthless.

 

Cleaning data requires front-end users that are familiar with the processes and technical experts (subject matter experts - SMEs) that are familiar with back-end processes, databases, and tables. For larger organizations there could be several front-end users required to filter through the data, in some cases line by line, to validate transactions. If an organization is having issues with dirty data, a thorough review of the processes that are in place should be completed. The processes should then be mapped and end-users should receive training to ensure they are accurately following the processes consistent with how the system was intended to operate. In many cases bad data is the result of users not properly executing the processes that are in place and the back-end processes that are in place aren't equipped to handle the countless ways a process can be completed.

 

For example, in warehouse operations an end-user may incorrectly pull stock from the shelf and rather than making a correction, they continue with processing and pulling products. Or, maybe they take corrective measures to fix the erroneous transaction, but they enter the wrong quantity when fixing the error. Then the forecasting team begins to pull the data and observes an increase in demand for the product that was erroneously pulled, making forecast projections that are inaccurate. This is why both the end-user and forecaster must work together to evaluate anomalies and ensure bad data is purged from the system. This could cause an inflated forecast for products that aren't even needed... costing a lot of money to a company; and this is one simple scenario that could occur daily.

 

What is Data Validation?


Data validation is the process of checking the accuracy and quality of data to ensure data integrity; that the data is clean and valid. This is usually completed through business rules that are set up on the back-end. This is especially important when completing ETL (Extract Transfer Load) jobs and moving data from one database to another - particularly when legacy systems are integrated with newer applications/programs. A process may check to make sure that data in Db A is the same type as what is in Db B for a particular field. For example, if you are creating a new table that joins data from a table in Db A and Db B, you will want to make sure that the columns are formatted the same way (character, decimal, date values, etc.), otherwise the output could produce inconsistencies.

 

Data Validation Techniques


There are a number of ways to implement, and enforce, valid data. Usually it is left to the programmer to ensure script is put in place to remove the likelihood of user entry errors. A couple of popular (and easy) techniques are range checking and type checking. With range checking the script is built to only accept specific ranges of data, whereas in type checking the type of data is enforced to ensure random characters aren't accepted into the system. Additionally, a system source loop back verification could be implemented. This is a procedure where data from the source system would be checked against data in a data warehouse; if the values equal data is considered "good." Visit Six Data Validation Techniques for additional techniques that can be applied.