Due to data migration from RDBMS (oracle/teradata) to HDFS (HIVE) the requirement is to compare full data set from RDBMS to HIVE data set, I understand that bring huge data from RDBMS/HIVE is a big network overhead but that is the requirement, I have developed a basic java framework in eclipse which will take source and target queries (with limited rows) and do a side by side comparison by fetching RDBMS and HIVE resulsets however to make it a more comprehensive validation I have to compare the keys of both the systems and check for duplicates in both the system, here are the things i tried till now:
Initialised two HashMaps one for RDBMS and one for HIVE then took PK as key and non-key attributes in a arraylist as value. Now with two hashmaps tried to compare the keys/values between it. But loading two resultsets and hashmaps in RAM would degrade the performance.
Tried to use REDIS in-memory database for storing Key/Value pairs however as I am trying to access REDIS through Java program not sure how to use REDIS hashmaps/hashsets the way we use in JAVA.
- Wrote the resultsets into two different text files but writing the file and reading/processing is time consuming.
For the fetching part of the data from RDBMS I have done things mentioned here and here I guess there maybe some tool for this job but am trying to develop something in opensource.