Sqoop import performance improvement

Question

whenever I try to import a huge volume of data from Teradata to Hive.. it is getting struck at the last two/three mapper for more than 2 hours.. I am using 8 mappers and --split-by Is there any way to increase performance? Since I am In prod I am using less no.of mappers.. kindly help

If it is getting struck at the last two/three mapper for more than 2 hours then your split column is not distributed evenly. What happens is explained in this answer: https://stackoverflow.com/a/37389134/2700344 — leftjoin, Jul 14 '18 at 19:21

Sandeep Singh · Answer 1 · 2018-07-14T19:49:43.157

Along with the increasing mappers, you can improve performance by increasing the fetch size as well. Use the following syntax in Sqoop command:

--fetch-size=<n> Where <n>represents the number of entries that Sqoop must fetch at a time. The default is 1000. You can set it up to 10000 or more.

Note: Increase the value of the fetch-size argument based on the volume of data that need to read. Set the value based on the available memory and bandwidth.

Please also increase heap size in Sqoop command to avoid memory issue such has heap exception or out of memory error. Increase memory using below property in Sqoop command

-Dmapreduce.map.memory.mb=8192 -Dmapreduce.map.java.opts=-Xmx7200m

Sqoop import performance improvement

1 Answers1