1

I have a requirement wherein I need to import a table from mysql to hive incrementally and am facing issues in doing that. This is what I have tried so far:

  1. I have create a job to import the table with the below mentioned query.

/sqoop job

--create test2 -- import 
--connect jdbc:mysql://URL
--username username 
--password password 
--table mysqlTablename 
--hive-import
--hive-overwrite
--direct 
--incremental lastmodified 
--check-column last_modified_time 
--last-value 0

First execution: Imports everything as expected with min boundary as '0' and the max boundary as current time. Second execution: All the changes from the last run are picked up, but the old rows are getting overwritten leaving only the rows that have been changed from the last run.

  1. I removed --hive-overwrite and --hive-import options and used the '--target-dir ' option. First execution: Gets everything as expected with min boundary as '0' and the max boundary as current time, but does not show up in Hive as the metastore is not updated. Second execution : It throws an error stating the directoty mentioned as parameter to --target-dir option already exist. This executes after removing the directoty from HDFS but does not solve the purpose.

I've found mention of this as a problem, and have not found a solution other than to put the new values into a side directory and run sqoop merge on the data to flatten it. I'd like to automate this in a shell script, and was wondering if there is some better way to handle this incremental update.

Wanted to check what would be the best option for the requirement wherein I need to import a mysql table and update the changes incrementally based on the column with the last modified time stamp into hive. i.e. create, update or delete the rows in Hive table based on the changes in mysql to keep them in sync.

Any help on this is greatly appreciated.

Regards Rohit

Rohit P
  • 11
  • 2

2 Answers2

1

Its very difficult for the hive-based system to handle the incremental loads involving updates to the records. This link would give much viable solution.

Community
  • 1
  • 1
sun_dare
  • 1,146
  • 2
  • 13
  • 33
1

use append command instead of lastmodified to get the updates continuously.

Eg :

--create test2 -- import 
--connect jdbc:mysql://URL
--username username 
--password password 
--table mysqlTablename 
--hive-import
--direct 
--incremental append 
--check-column last_modified_time 
--last-value "0"
Suresh Ram
  • 1,034
  • 3
  • 16
  • 40