how to valiate Millions of data?

Question

How to validate the scenario?

Scenario 1:

Source File is Flat File which contains Millions of data. All the data from the source file is loaded to target table in the Data Base.

Now the question is how to validate if all the data is loaded in the target table correctly??

Note: we can’t use xls to validate as we have Millions of records in it.

You know the count of lines in the file, each line should become a new record in the database (in most cases). You know the number of records in the table before and after executing the ETL. Compare the numbers, and you will know if all rows are loaded or not. This is a quantitive check, not a quality. As a general advice: Load the data into a separate table dedicated to the ETL, without transforming the records too much. Use that table as a source in subsequent ETLs. — Pred, Jan 28 '16 at 14:09
As an additional note: Using the tools the DBMS provides, you can be sure, that some error can not happen. Use constraints (NOT NULL, CHECK, etc) to make the ETL fail if the data is not sufficient, erroneous, etc. — Pred, Jan 28 '16 at 14:31

score 2 · Answer 1 · edited May 23 '17 at 12:31

There are lots of ways one can validate data. Much of it depends on three things:

How much time do you have for validation?
What are your processing capabilities?
Is the data on a QA or Production SQL server?

If you are in QA and have lots of processing power, you can do basic checks:

Where there any warnings or errors during the data load?
Count the total number of items in the database vs. the raw file
Count the total number of null records in the database
Check the total number of columns vs. the raw file
Check the length of the variables. Are they as expected?
Are any character columns unexpectedly truncated?
Are numeric columns out to the correct number of significant digits?
Are dates reasonable? For example, if you expected dates from 2004, do they say 1970?
How many duplicates are there?
Check if the data in the columns make sense. A few questions you can ask: are any rows "shifted?" Are numeric variables in numeric columns? Is the key column actually a key? Do the column names make sense? Your check of null records should help detect these things.
Can you manually calculate any columns and compare your calculation to the one in the file?

If you are low on processing power or are on a production server and do not want to risk degrading performance for other users, you can do many of the above checks with a simple random sample. Take, say, 100,000 rows at a time.; or, stratify it if needed.

These are just a few checks you can do. The more comparisons and sanity checks, the better off you are.

Most importantly, communicate these findings and anything that seems strange to the file owner. They should be able to give you additional insight to the data load is correct, or if they even gave you the right file in the first place.

You're loading the data and providing as many reasonable checks as possible. If they're satisfied with the outcome, and you're satisfied with the outcome, you should consider the data valid.

score 0 · Answer 2 · answered Jan 28 '16 at 14:23

I think the most complete solution would be to export the table back to a 2nd flat file that should be identical to the first, and then write a script that does a line by line diff check. You will be able to see if even a single row is different.

Given that you are migrating millions of rows of data I'm assuming that running a script overnight will not be a huge deal vs data integrity.

For quick validation you can just check that the row counts are the same and that there's no obviously bad data like for example a column mapped wrong or an entire column being null.

I have a requirement where I have to validate each and every data present in the Source[Flat Files] column to Target[Data Base] column.The number of record in the source file is 10 mil. — Ranjan Kumar Das, Feb 02 '16 at 16:05

score 0 · Answer 3 · answered Jan 28 '16 at 14:31

Im not expert on export from files, but if i should solve this issue i follow something like this.

Load the file into a plain TableA with no restriction so import process run ok.
Create another TableB with all validation. Type, string length, FK.
Create one store procedure to move the data from TableA to TableB
Include a catch error and insert into another table Errors where you insert row_id and err_description

how to valiate Millions of data?

3 Answers3