0

I need to move a large amount of data from an Oracle Database to Hadoop without connecting the two systems. Is it possible to export data from Oracle via Sqoop directly to the local filesystem without importing to HDFS. I'd like to export to ORC and then just move the files via external disks to the Hadoop cluster.

Shawn
  • 119
  • 1
  • 11

1 Answers1

1

You can not use SQOOP in your case. SQOOP ("SQL to Hadoop") runs in hadoop and uses, by default, JDBC to connect with the DB (as i explain in this answer, you can change it using --direct option). If hadoop nodes can not connect with the DB server, then you can not use it.

ORC is a very specific format used by Hive, you will require to find how use hive libraries to create the ORC files outside hadoop clustes, if it is possible.

By your constrains I will suggest to export DB using DB´s dump capabilities into a CSV file, compress the file and then copy it into HDFS.

If you are planning to use Hive, then you can LOAD the text file into a table configured to store the data using ORC.

Tiago Martins Peres
  • 14,289
  • 18
  • 86
  • 145
RojoSam
  • 1,476
  • 12
  • 15
  • I agree with the suggestion to avoid ORC -- CSV, TSV or other common neutral format that your Oracle database can produce for intermediate files will simplify the eventual load into Hadoop, at which point you can use any suitable Hadoop format. – Tom Harrison Jun 24 '16 at 02:00
  • The only problem with CSV formats is there may be line feeds in the data and we're trying to preserve the data exactly as it exists. I may be stuck with using xml but I really don't want to due to its size. – Shawn Jun 24 '16 at 12:39
  • You can use a different line feeds in your CSV like "|", "@@", "^A" or whatever and the same for columns delimiter. The problem is that not all the components in hadoop and DB dunps tools allow you to specify these parameters and you will require to implement your own writer/reader. In hive you can specify both parameters when you create the table. – RojoSam Jun 24 '16 at 14:00
  • Found out that SQLCl a new client from Oracle will let me export to CSV and handles all the escaping and stuff. I didn't realize that but it looks like you can have line feeds and stuff as long as everything is double quoted. – Shawn Jun 24 '16 at 16:42
  • The export is the easy part. The problem will be when you try to use that data in hadoop. Hadoop doesn't really know about CSV conventions, for hadoop it is just a texfile and it will read it line by line. And for hadoop the line end with the first line feed. I don't think hadoop will handle double quoted text. – RojoSam Jun 24 '16 at 16:58
  • The CSV Serde included with recent versions of Hive looks like it handles the RFC CSV format including line feeds so I should be good. I'm setting up some test cases to see how it works. – Shawn Jun 27 '16 at 12:20
  • Well that's a bust, looks like the patches to support weren't added when the project was added to Hive. I may have to look into writing my own input format because one of my key requirements is we can modify the input data in anyway. – Shawn Jun 27 '16 at 12:28
  • Last comment was supposed to say I can't modify the input data. – Shawn Jun 27 '16 at 12:52