what are the following commands in sqoop?

Question

Can anyone tell me what is the use of --split-by and boundary query in sqoop?

sqoop import --connect jdbc:mysql://localhost/my --username user --password 1234 --query 'select * from table where id=5 AND $CONDITIONS' --split-by table.id --target-dir /dir

score 44 · Accepted Answer · answered Jul 30 '13 at 08:35

--split-by : It is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into your cluster. It can be used to enhance the import performance by achieving greater parallelism. Sqoop creates splits based on values in a particular column of the table which is specified by --split-by by the user through the import command. If it is not available, the primary key of the input table is used to create the splits.

Reason to use : Sometimes the primary key doesn't have an even distribution of values between the min and max values(which is used to create the splits if --split-by is not available). In such a situation you can specify some other column which has proper distribution of data to create splits for efficient imports.

--boundary-query : By default sqoop will use query select min(), max() from to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.

Reason to use : If --split-by is not giving you the optimal performance you can use this to improve the performance further.

score 23 · Answer 2 · answered Jul 30 '13 at 13:31

--split-by is used to distribute the values from table across the mappers uniformly i.e. say u have 100 unique records(primary key) and if there are 4 mappers, --split-by (primary key column) will help to distribute you data-set evenly among the mappers.

$CONDITIONS is used by Sqoop process, it will replace with a unique condition expression internally to get the data-set. If you run a parallel import, the map tasks will execute your query with different values substituted in for $CONDITIONS. e.g., one mapper may execute "select bla from foo WHERE (id >=0 AND id < 10000)", and the next mapper may execute "select bla from foo WHERE (id >= 10000 AND id < 20000)" and so on.

score 14 · Answer 3 · answered Jul 29 '13 at 13:43

Sqoop allows you to import data in parallel and --split-by and --boundary-query allow you more control. If you're just importing a table then it'll use the PRIMARY KEY however if you're doing a more advanced query, you'll need to specify the column to do the parallel split.

i.e.,

  sqoop import \
    --connect 'jdbc:mysql://.../...' \
    --direct \
    --username uname --password pword \
    --hive-import \
    --hive-table query_import \
    --boundary-query 'SELECT 0, MAX(id) FROM a' \
    --query 'SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND $CONDITIONS'\
    --num-mappers 3
    --split-by a.id \
    --target-dir /data/import \
    --verbose

Boundary Query lets you specify an optimized query to get the max, min. else it will attempt to do MIN(a.id), MAX(a.id) ON your --query statement.

The results will be (if min=0, max=30) is 3 queries that get run in parallel:

SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 0 AND 10;
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 11 AND 20;
SELECT a.id, a.name, b.id, b.name FROM a, b WHERE a.id = b.id AND a.id BETWEEN 21 AND 30;

Sorry. I still don't get it. What is --split-by? is it like, something to do with the processing of the command? — NJ_315, Jul 30 '13 at 06:19

score 5 · Answer 4 · answered Nov 06 '18 at 07:09

Split by :

why it is used? -> to enhance the speed while fetching the data from rdbms to hadoop
How it works? -> By default there are 4 mappers in sqoop , so the import works parallely. The entire data is divided into equal partitions. Sqoop considers primary key column for splitting the data and then finds out the maximum and minimum range from it and then makes the 4 ranges for 4 mappers to work. Eg. 1000 records in primary key column and max value =1000 and min value -0 so sqoop will create 4 ranges - (0-250) , (250-500),(500-750),(750-1000) and depending on values of column the data will be partitioned and given to 4 mappers to store it on HDFS. so if in case the primary key column is not evenly distributed so with split-by you can change the column-name for evenly partitioning.

In short: Used for partitioning of data to support parallelism and improve performance

score 2 · Answer 5 · edited Aug 17 '15 at 08:03

2

Also if we specify --query value within double quotes(" "), we need to precede $CONDITIONS with a slash(\)

--query "select * from table where id=5 AND \$CONDITIONS"

or else

--query 'select * from table where id=5 AND $CONDITIONS'

edited Aug 17 '15 at 08:03

Arulkumar

12,966
14
47
68

answered Aug 17 '15 at 07:40

jatin bhola

21
4

what are the following commands in sqoop?

5 Answers5

Linked