Trying to fetch data from DB2 to Hbase using Sqoop is very slow

Question

Thanks in advance.

I have been trying to import the data from DB2 to HBase table using SQOOP which is taking very very long time to even initiate the map and reduce . I can see only Map 0 and Reduce 0 all the times .

I can put the same command in DB2 and the results are quite faster than I expected. But when I import the same to HBASE . Taking very long time(10 hours) . Created a sample data in DB2(150 records) and tried to import to HBASE still taking the same amount of time .

sqoop import --connect jdbc:db2://{hostname}:50001/databasename --username user --password pass --hbase-create-table --hbase-table new_tbl --column-family abc --hbase-row-key=same  --query "select a,b,c,d,e concat(a,e) from table_name where \$CONDITIONS AND a>='2018-08-01 00:00:01' and b<='2018-08-01 00:00:02'"  -m 1

Tried adjusted all the configurations

yarn.nodemanager.resource.memory-mb=116800
yarn.scheduler.minimum-allocation-mb=4096
mapreduce.map.memory.mb=4096
mapreduce.reduce.memory.mb=8192
mapreduce.map.java.opts=-Xmx3072m
mapreduce.reduce.java.opts=-Xmx6144m
yarn.nodemanager.vmem-pmem-ratio=2.1

In Sqoop Side I have tried to tweak the query as well little configurations as well -m 4 create some inconsistency in records -removed the filter(timestamps(a,b)) still taking longtime (10 hours)

HBASE performance test results are pretty good .

        HBase Performance Evaluation
                Elapsed time in milliseconds=705914
                Row count=1048550
        File Input Format Counters
                Bytes Read=778810
        File Output Format Counters
                Bytes Written=618

real    1m29.968s
user    0m10.523s
sys     0m1.140s

Read this about split-by column https://stackoverflow.com/a/37389134/2700344 — leftjoin, Dec 01 '18 at 17:48

score 1 · Answer 1 · answered Dec 01 '18 at 09:12

It is hard to suggest unless you show the sample data and data type. The extra mapper will work correctly and efficiently only when you have a fair distribution of records among mappers. If you have a primary key available in the table, you can give it as split column and mappers will distribute the workload equally and start fetching slices in balanced mode. While running you can also see the split key distribution and record count from the log itself.

If your cluster is not having enough memory for resources, it may take longer time and sometimes it is in submit mode for a long time as YARN cannot allocate memory to run it.

Instead of trying to HBase, you can first try doing it with HDFS as a storage location and see the performance and also check the Job detail to understand the MapReduce behavior.

Trying to fetch data from DB2 to Hbase using Sqoop is very slow

1 Answers1