Sqoop import using query option from Oracle database

Question

I am trying to import data based on a query from Oracle database. The query includes joins and on multiple tables. This is taking a very long time. Is there anything to optimize this?

sqoop import \
--connect jdbc:oracle:thin://@localhost/DB \
--username hybb \
--password **** \
--query "SELECT l4.id , TRUNC(timestamp) dt, TO_CHAR(timestamp,'HH24') hour, flag , type , l2.on_off_det , CASE WHEN flag = 1 THEN fro WHEN flag = 0 THEN to END AS CRP, SUM(CASE WHEN flag = 0 THEN dur WHEN flag = 1 THEN ROUND(CNT,0) ELSE 0 END) csum, COUNT(l4.id) ccnt FROM admin.t1 l2, admin.t2 l3 , admin.t3 l4, admin.t4 l5 WHERE l4.rtk=l5.rtk AND l3.ct NOT IN ('test','test1','test2') AND l4.on_off_key=l2.on_off_key AND l4.id_KEY=l3.id_KEY AND l4.startdate between 20161111 AND 20161230 AND l2.on_off_det IN ('Off','OffInt','On') and type IN ('s','v','m') AND \$CONDITIONS GROUP BY l4.id, timestamp,flag,type,l2.on_off_det, CASE WHEN flag = 1 THEN fro WHEN flag = 0 THEN to END " \
-m 1 \
--target-dir /tmp/output

Without the code it is hard to say what is going on. Provide a snippet. See this: https://stackoverflow.com/help/mcve — coelhudo, Jan 16 '18 at 07:39

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

You can definitely increase the performance, you have to test out few options to find out which solution works better for your case. I have two suggestions

First One >>> Increase the mappers by adding a split by key.

What is split-by clause ?? From Stack

It is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism. Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits.

which type of fields should be used for splitting ?? Sometimes the primary key doesn't have an even distribution of values between the min and max values(which is used to create the splits if --split-by is not available). In such a situation you can specify some other column which has proper distribution of data to create splits for efficient imports.

sqoop import \
--connect jdbc:oracle:thin://@localhost/DB \
--username hybb \
--password **** \
--split-by <column name>
--query "SELECT l4.id , TRUNC(timestamp) dt, TO_CHAR(timestamp,'HH24') hour, flag , type , l2.on_off_det , CASE WHEN flag = 1 THEN fro WHEN flag = 0 THEN to END AS CRP, SUM(CASE WHEN flag = 0 THEN dur WHEN flag = 1 THEN ROUND(CNT,0) ELSE 0 END) csum, COUNT(l4.id) ccnt FROM admin.t1 l2, admin.t2 l3 , admin.t3 l4, admin.t4 l5 WHERE l4.rtk=l5.rtk AND l3.ct NOT IN ('test','test1','test2') AND l4.on_off_key=l2.on_off_key AND l4.id_KEY=l3.id_KEY AND l4.startdate between 20161111 AND 20161230 AND l2.on_off_det IN ('Off','OffInt','On') and type IN ('s','v','m') AND \$CONDITIONS GROUP BY l4.id, timestamp,flag,type,l2.on_off_det, CASE WHEN flag = 1 THEN fro WHEN flag = 0 THEN to END " \
-m <increase number based on your cluster> \
--target-dir /tmp/output

Refer this Horton works community page for effective usage of mappers.

Second One >>>

There is an option in sqoop called direct option (--direct), recently oracle also supporting this option, but I'm not sure if you run the complex query or not, refer this link to learn more on that.

Sqoop import using query option from Oracle database

1 Answers1