I have the use case below :
My application has a table having multiyear data in RDBMS DB. We have used sqoop to get data into HDFS and have loaded into hive table partitioned by year, month.
Now, the application updates, and inserts new records into RDBMS Table table daily as well. These updated records can span across history months. Updated records and new insert records can be determined by updated timestamp field (it will have current day timestamp).
Now the problem here is : how to do delta/incremental load hive table daily using these updated records.
-> I know there is a sqoop functionality which allows incremental imports. But, only new incremental import is not enough for us.
Because -
-> I can not directly insert these records (using insert into) in hive table because it will result in duplicate records (updated records).
-> Same way I can not use insert overwrite statement as these are just update and insert records spanning across multiple month. Insert overwrite will delete earlier records.
Of course, easiest option is to get full data using sqoop daily but we don't want to do it as data volume is large.
So , basically we want to fully load only those partitions for which we have received update/insert records.
We are open to explore option at hive or sqoop end. Can you please let us know?
Thanks in advance.