0

Due to data migration from RDBMS (oracle/teradata) to HDFS (HIVE) the requirement is to compare full data set from RDBMS to HIVE data set, I understand that bring huge data from RDBMS/HIVE is a big network overhead but that is the requirement, I have developed a basic java framework in eclipse which will take source and target queries (with limited rows) and do a side by side comparison by fetching RDBMS and HIVE resulsets however to make it a more comprehensive validation I have to compare the keys of both the systems and check for duplicates in both the system, here are the things i tried till now:

  1. Initialised two HashMaps one for RDBMS and one for HIVE then took PK as key and non-key attributes in a arraylist as value. Now with two hashmaps tried to compare the keys/values between it. But loading two resultsets and hashmaps in RAM would degrade the performance.

  2. Tried to use REDIS in-memory database for storing Key/Value pairs however as I am trying to access REDIS through Java program not sure how to use REDIS hashmaps/hashsets the way we use in JAVA.

  3. Wrote the resultsets into two different text files but writing the file and reading/processing is time consuming.

For the fetching part of the data from RDBMS I have done things mentioned here and here I guess there maybe some tool for this job but am trying to develop something in opensource.

Community
  • 1
  • 1
Vinod
  • 376
  • 2
  • 11
  • 34

1 Answers1

0

Does your data have a timestamp or any increasing value that can be used to order the data or can one duplicate element from one data source be anywhere in the other source? If there is anything to order the data (like a timestamp) you could use any kind of streaming system and 'simple' perform a distinct selection. However, more information are required about the type of data you are working with.