I am trying to migrate some Oracle DB tables to the cloud(Snowflake) and I am wondering what's the best way to create .csv files from the tables.
I have around 200 tables and some tables over 30M records. I want to bulk the data
I am trying to migrate some Oracle DB tables to the cloud(Snowflake) and I am wondering what's the best way to create .csv files from the tables.
I have around 200 tables and some tables over 30M records. I want to bulk the data
So I am in scenario to get quickly CSV export form 300GB oracle db and store them in S3 for Spark/Hive analysis, spool is super slow, SQL developer is super slow. Ok what next?
https://github.com/hyee/OpenCSV
Super fast, here is example on how to use, you need to register your odbc jar for Oracle db:
package com.company;
import com.opencsv.CSVWriter;
import com.opencsv.ResultSetHelperService;
import java.sql.*;
public class Main {
public static void main(String[] args) throws Exception {
// write your code here
//step1 load the driver class
Class.forName("oracle.jdbc.driver.OracleDriver");
//step2 create the connection object
Connection con= DriverManager.getConnection(
"jdbc:oracle:thin:@host:port:service_name",
"ora_user","password");
//step3 create the statement object
Statement stmt=con.createStatement();
//step4 execute query
ResultSet rs=stmt.executeQuery("select c1,c2,c3 from my shitty table");
// while(rs.next())
// System.out.println(rs.getInt(1)+" "+rs.getString(2)+" "+rs.getString(3));
//step5 close the connection object
String fileName = "C:\\Temp\\output.csv";
boolean async = true;
try (CSVWriter writer = new CSVWriter(fileName)) {
//Define fetch size(default as 30000 rows), higher to be faster performance but takes more memory
ResultSetHelperService.RESULT_FETCH_SIZE=50000;
//Define MAX extract rows, -1 means unlimited.
ResultSetHelperService.MAX_FETCH_ROWS=-1;
writer.setAsyncMode(async);
int result = writer.writeAll(rs, true);
//return result - 1;
System.out.println("Result: " + (result - 1));
}
con.close();
}
//Extract ResultSet to CSV file, auto-compress if the fileName extension is ".zip" or ".gz"
//Returns number of records extracted
public static int ResultSet2CSV(final ResultSet rs, final String fileName, final String header, final boolean aync) throws Exception {
try (CSVWriter writer = new CSVWriter(fileName)) {
//Define fetch size(default as 30000 rows), higher to be faster performance but takes more memory
ResultSetHelperService.RESULT_FETCH_SIZE=10000;
//Define MAX extract rows, -1 means unlimited.
ResultSetHelperService.MAX_FETCH_ROWS=20000;
writer.setAsyncMode(aync);
int result = writer.writeAll(rs, true);
return result - 1;
}
}
}
Another fast solution, still I think its slower than above, will be using Spark directly:
query = "(select empno,ename,dname from emp, dept where emp.deptno = dept.deptno) emp"
empDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:username/password@//hostname:portnumber/SID") \
.option("dbtable", query) \
.option("user", "db_user_name") \
.option("password", "password") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
empDF.printSchema()
empDF.show()
# Write to S3
empDF.write().format(“orc/parquet/csv.gz”).save(“s3://bucketname/key/”)
Sure you can repartition, and do some other optimizations.
1) Based on the raw file size better approach is to use RAW DB file.
2) If that is not an option, use shell script ( relatively faster the utl_file) Sample script
export ORACLE_HOME=<Oracle home path>
export PATH=$PATH:$ORACLE_HOME/bin
sqlplus -s user1/user1<<!
Spool on
set heading off
SET FEEDBACK OFF
set wrap off
set colsep ,
SET LINESIZE 200
set trims on
Spool /tmp/<Yourfilename>.csv
select 'col1','col2' from dual;
select col1,col2 from tab1;
.
.
.
.
.
.
.
.
Spool off
EXIT
3) Use utl_file for creation, as suggested by one of the earlier user https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:88212348059
Regards Faisal