6

I am trying to migrate some Oracle DB tables to the cloud(Snowflake) and I am wondering what's the best way to create .csv files from the tables.

I have around 200 tables and some tables over 30M records. I want to bulk the data

mustaccio
  • 18,234
  • 16
  • 48
  • 57
Jorge Flores
  • 107
  • 3
  • 8
  • 2
    How do you want to "bulk" data using CSV? Are you sure that there is no better way than using a raw file? – Nico Haase Jan 23 '19 at 21:10
  • 1
    See if [this answer](https://stackoverflow.com/questions/643137/how-do-i-spool-to-a-csv-formatted-file-using-sqlplus) helps – mustaccio Jan 23 '19 at 21:41
  • In my opinion the best way will be create the stored procedure. You can use UTL_FILE package to this task. Please review this topic https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:88212348059 – fuNcti0n Jan 23 '19 at 20:32
  • I did create and stored procedure with this method but just the problem is the performance. I have tables with millions records so is not really efficient for that. I cannot create a C procedure as well. – Jorge Flores Jan 23 '19 at 20:51
  • The thing is that I am migrating my data from OracleDB on premises to the cloud DWH(Snowflake). I found csv useful sense that Snowflake can take this files and parse to a table but the problem is with the amount of data is no that huge couple of tables 30M and other 20M in other tables buts is taking forever to generate CSV with UTL @NicoHaase – Jorge Flores Jan 24 '19 at 15:32

2 Answers2

4

So I am in scenario to get quickly CSV export form 300GB oracle db and store them in S3 for Spark/Hive analysis, spool is super slow, SQL developer is super slow. Ok what next?

https://github.com/hyee/OpenCSV

Super fast, here is example on how to use, you need to register your odbc jar for Oracle db:

package com.company;

import com.opencsv.CSVWriter;
import com.opencsv.ResultSetHelperService;

import java.sql.*;

public class Main {

    public static void main(String[] args) throws Exception {

    // write your code here
        //step1 load the driver class
        Class.forName("oracle.jdbc.driver.OracleDriver");

//step2 create  the connection object
        Connection con= DriverManager.getConnection(
                "jdbc:oracle:thin:@host:port:service_name",
                "ora_user","password");

//step3 create the statement object
        Statement stmt=con.createStatement();

//step4 execute query
        ResultSet rs=stmt.executeQuery("select c1,c2,c3 from my shitty table");
//        while(rs.next())
//            System.out.println(rs.getInt(1)+"  "+rs.getString(2)+"  "+rs.getString(3));

//step5 close the connection object


        String fileName = "C:\\Temp\\output.csv";
        boolean async = true;

        try (CSVWriter writer = new CSVWriter(fileName)) {

            //Define fetch size(default as 30000 rows), higher to be faster performance but takes more memory
            ResultSetHelperService.RESULT_FETCH_SIZE=50000;
            //Define MAX extract rows, -1 means unlimited.
            ResultSetHelperService.MAX_FETCH_ROWS=-1;
            writer.setAsyncMode(async);
            int result = writer.writeAll(rs, true);
            //return result - 1;
            System.out.println("Result: " + (result - 1));
        }
        con.close();
    }

    //Extract ResultSet to CSV file, auto-compress if the fileName extension is ".zip" or ".gz"
//Returns number of records extracted
    public static int ResultSet2CSV(final ResultSet rs, final String fileName, final String header, final boolean aync) throws Exception {
        try (CSVWriter writer = new CSVWriter(fileName)) {
            //Define fetch size(default as 30000 rows), higher to be faster performance but takes more memory
            ResultSetHelperService.RESULT_FETCH_SIZE=10000;
            //Define MAX extract rows, -1 means unlimited.
            ResultSetHelperService.MAX_FETCH_ROWS=20000;
            writer.setAsyncMode(aync);
            int result = writer.writeAll(rs, true);
            return result - 1;
        }
    }
}

Another fast solution, still I think its slower than above, will be using Spark directly:

query = "(select empno,ename,dname from emp, dept where emp.deptno = dept.deptno) emp"
empDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:oracle:thin:username/password@//hostname:portnumber/SID") \
    .option("dbtable", query) \
    .option("user", "db_user_name") \
    .option("password", "password") \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \
    .load()
empDF.printSchema()
empDF.show()

# Write to S3
empDF.write().format(“orc/parquet/csv.gz”).save(“s3://bucketname/key/”)

Sure you can repartition, and do some other optimizations.

kensai
  • 943
  • 9
  • 16
  • Thanks, right now I'm using PLSQL(UTL_FILE) + Parallel execution. It solved some of my problems but still not really fast. I'l check your solution looking forward. – Jorge Flores Jun 10 '19 at 13:17
  • And actually my stating destination is S3 bucket but I'm writing my files first to file system(oracle) then use aws gateway storage to put them in s3. – Jorge Flores Jun 10 '19 at 13:18
  • I am just going to test pyspark solution, will report results against opencsv – kensai Jun 11 '19 at 08:07
  • @JorgeFlores - OpenCSV is deadly fast, but would require to integrate S3 drivers to be able store CSV directly in S3 for as fast as possible stuff - https://github.com/hyee/OpenCSV/issues/4 – kensai Jun 11 '19 at 08:08
  • Do you have any example to use OpenCSV? – Jorge Flores Jan 10 '20 at 18:17
  • Aha, you mean to store in s3, sure, you need to add s3 drivers..., but that is not big deal right? – kensai Jan 15 '20 at 14:18
  • This was a genius answer, thank you. – 2ps Dec 30 '21 at 09:31
2

1) Based on the raw file size better approach is to use RAW DB file.

2) If that is not an option, use shell script ( relatively faster the utl_file) Sample script

export ORACLE_HOME=<Oracle home path>
export PATH=$PATH:$ORACLE_HOME/bin
sqlplus -s user1/user1<<!
Spool on
set heading off
SET FEEDBACK OFF
set wrap off
set colsep ,
SET LINESIZE 200
set trims on
Spool /tmp/<Yourfilename>.csv
select 'col1','col2' from dual;
select col1,col2 from tab1;
.
.
.
.
.
.
.
.


Spool off
EXIT

3) Use utl_file for creation, as suggested by one of the earlier user https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:88212348059

Regards Faisal

M Faisal
  • 31
  • 2
  • what do you mean by raw? the purpose of generate CSV is that I will take that files from snowflake(Cloud DWH) – Jorge Flores Jan 24 '19 at 18:25
  • Apologies , I did not see your comment earlier. By RAW I meant taking Data base storage files. However as you mentioned you are taking data from Cloud storage that would not be an option. – M Faisal Feb 12 '19 at 15:12
  • 1
    spool is slow even with disabled term output... not a solution on large data – kensai Jun 09 '19 at 10:09