I would like to import-all tables with parallelism from a schema in Mysql. Few tables in schema have primary key and few not. If we set no of mappers(--m>1) tables with no primary key gets failed.
how can we import-all tables with parallelism ?
I would like to import-all tables with parallelism from a schema in Mysql. Few tables in schema have primary key and few not. If we set no of mappers(--m>1) tables with no primary key gets failed.
how can we import-all tables with parallelism ?
If you have integer column, not a PK, evenly distributed, you can use it as a split-by, skip all I have written here and read this: Sqoop import : composite primary key and textual primary key
If you have some column combination which can deterministically define the order of rows (composite PK candidate), you can use row_number, see https://stackoverflow.com/a/67910296/2700344
Can non-deterministic row_number be used? sqoop will calculate query to determine min and max to get ranges, then the same query will be executed in in different mappers with filters to get range and if row_number is not deterministic - then the same row may appear in different ranges, so, the answer is No, it should be deterministic.
If you do not have such composite key then, probably you can use some other means like rowid
in Oracle for the same purpose (only if the table is static in the scope of your import job, because ROWID can change for example if some rows deleted and re-inserted, table reloaded). If it is MySQL, then unfortunately, you have no such column. The only solution is to add autoincrement sequence surrogate primary key to your table and populate it.
I agree completely with the answer of @leftjoin on PK.
There is another way to achieve parallelism and you can use for an RDBMS.
You can use --split-by <<column>> -m no_of_mappers
.
This tells sqoop which column to use to create the splits while importing the data into hdfs. This need not to be unique and can be any column on which data is somewhat evenly distributed like org_id or department_id (INT). Sqoop will fire MAX(col) and MIN(COL) and then split data based on these values. Then split source data per mapper to import. You can understand this can take few more minutes if table size is huge to calculate max and min.
Example, For, a 1000 row table, --split-by dept_id -m 4
.
Lets say, we have 6 departments, sqoop will split the data into 4 parts - dept1,dept2 and 3, dept 4, dept 5 and 6. Pls note : This just an example actual distribution can be different.