0

One of my Oracle table contains 265 million records, I need to push that table from Oracle database to hdfs but this table doesnt have any primary key/Unique column. Hence, I cant use multiple mappers. If I use multiple mappers, I have to specify Split by column. Whats the best way to sqoop the table. Any leads are appreciated.

  • 1
    Check this out. https://stackoverflow.com/questions/17923420/what-are-the-following-commands-in-sqoop/17942067 Also refer sqoop documentation for further info. You can try using some column with evenly distributed data in split by clause. – yammanuruarun Jan 16 '20 at 05:43

1 Answers1

0

In order to use multiple mappers, you will need a --split-by parameter. The best column to choose is one that is not null in all 265m rows and evenly distributed. Primary key meets that criteria because it is sequential and in all rows.

Any column that is evenly distributed across the data set could be a good choice for a --split-by choice. The link @yammanuruarun posted includes the --boundary-query argument to help limit the work the RDBMS has to do to return those rows. I suggest using a Fibbonacci sequence for your -m 1,2,3,5,8.

Also, check out: How to find optimal number of mappers when running Sqoop import and export?

Chris Marotta
  • 558
  • 6
  • 25