sqoop import-all with parallelism when few tables are not having primary key

Question

I would like to import-all tables with parallelism from a schema in Mysql. Few tables in schema have primary key and few not. If we set no of mappers(--m>1) tables with no primary key gets failed.

how can we import-all tables with parallelism ?

leftjoin · Answer 1 · 2021-06-11T09:32:47.387

If you have integer column, not a PK, evenly distributed, you can use it as a split-by, skip all I have written here and read this: Sqoop import : composite primary key and textual primary key

If you have some column combination which can deterministically define the order of rows (composite PK candidate), you can use row_number, see https://stackoverflow.com/a/67910296/2700344

Can non-deterministic row_number be used? sqoop will calculate query to determine min and max to get ranges, then the same query will be executed in in different mappers with filters to get range and if row_number is not deterministic - then the same row may appear in different ranges, so, the answer is No, it should be deterministic.

If you do not have such composite key then, probably you can use some other means like rowid in Oracle for the same purpose (only if the table is static in the scope of your import job, because ROWID can change for example if some rows deleted and re-inserted, table reloaded). If it is MySQL, then unfortunately, you have no such column. The only solution is to add autoincrement sequence surrogate primary key to your table and populate it.

score 1 · Answer 2 · answered Jun 10 '21 at 14:39

I agree completely with the answer of @leftjoin on PK.
There is another way to achieve parallelism and you can use for an RDBMS. You can use --split-by <<column>> -m no_of_mappers.
This tells sqoop which column to use to create the splits while importing the data into hdfs. This need not to be unique and can be any column on which data is somewhat evenly distributed like org_id or department_id (INT). Sqoop will fire MAX(col) and MIN(COL) and then split data based on these values. Then split source data per mapper to import. You can understand this can take few more minutes if table size is huge to calculate max and min.
Example, For, a 1000 row table, --split-by dept_id -m 4.
Lets say, we have 6 departments, sqoop will split the data into 4 parts - dept1,dept2 and 3, dept 4, dept 5 and 6. Pls note : This just an example actual distribution can be different.

+ Yes, agree with your answer. Ofcouse if you have evenly distributed column and do not need to generate row_number because it is int, your recommendation is fine. What will happen if it is not evenly distributed is a skew between mappers and parallelism will not work, details here: https://stackoverflow.com/a/37389134/2700344 — leftjoin, Jun 11 '21 at 08:51
Awesome answer in the link you shared, thank you. Yes, finding an INT column on which data is evenly distributed can be challenging. Sometime INT PK columns can have gaps too. The sequence can be like 10,11,12,256, 257,258 because of some process issue. — Koushik Roy, Jun 11 '21 at 09:32

sqoop import-all with parallelism when few tables are not having primary key

2 Answers2