I have very large table (hundreds of millions rows, contains numbers and strings) in Oracle and I need to read all content of this table, format it and write to file or any other resource. Generally my solution looks like this:
package my.odp;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.TimeUnit;
import java.lang.Throwable;
import java.sql.*;
public class Main {
public static volatile boolean finished = false;
public static void main(final String[] args) throws InterruptedException {
final ArrayBlockingQueue<String> queue = new ArrayBlockingQueue<String>(10000);
final Thread writeWorker = new Thread("ODP Writer") {
public void run() {
try {
File targetFile = new File(args[0]);
FileWriter fileWriter = new FileWriter(targetFile);
BufferedWriter writer = new BufferedWriter(fileWriter);
String str;
try {
while (!finished) {
str = queue.poll(200, TimeUnit.MILLISECONDS);
if (str == null) {
Thread.sleep(50);
continue;
}
writer.write(str);
writer.write('\n');
}
} catch (InterruptedException e) {
writer.close();
return;
}
}
catch (Throwable e) {
e.printStackTrace();
return;
}
}
};
final Thread readerThread = new Thread("ODP Reader") {
public void run() {
try {
Class.forName("oracle.jdbc.OracleDriver");
Connection conn = DriverManager.getConnection("jdbc:oracle:thin:@//xxx.xxx.xxx.xxx:1521/orcl", "user", "pass");
Statement stmt = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(500000);
ResultSet rs = stmt.executeQuery("select * from src_schema.big_table_view");
System.out.println("Fetching result");
while (rs.next()) {
StringBuilder sb = new StringBuilder();
sb.append(rs.getString(1)).append('\t');//OWNER
sb.append(rs.getString(2)).append('\t');//OBJECT_NAME
sb.append(rs.getString(3)).append('\t');//SUBOBJECT_NAME
sb.append(rs.getLong(4)).append('\t');//OBJECT_ID
sb.append(rs.getLong(5)).append('\t');//DATA_OBJECT_ID
sb.append(rs.getString(6)).append('\t');//OBJECT_TYPE
sb.append(rs.getString(7)).append('\t');//CREATED
sb.append(rs.getString(8)).append('\t');//LAST_DDL_TIME
sb.append(rs.getString(9)).append('\t');//TIMESTAMP
sb.append(rs.getString(10)).append('\t');//STATUS
sb.append(rs.getString(11)).append('\t');//TEMPORARY
sb.append(rs.getString(12)).append('\t');//GENERATED
sb.append(rs.getString(13)).append('\t');//SECONDARY
sb.append(rs.getString(14)).append('\t');//NAMESPACE
sb.append(rs.getString(15));//EDITION_NAME
queue.put(sb.toString());
}
rs.close();
stmt.close();
conn.close();
finished = true;
} catch (Throwable e) {
e.printStackTrace();
return;
}
}
};
long startTime = System.currentTimeMillis();
writeWorker.start();
readerThread.start();
System.out.println("Waiting for join..");
writeWorker.join();
System.out.println("Exit:"+ (System.currentTimeMillis() - startTime));
}
}
There're two threads: one for fetching rows from result set and one for writing string values. Measured loading speed was about 10Mb/s and in my case I need to make it 10 times faster. Profiler shows that the most time consuming methods are
oracle.jdbc.driver.OracleResultSetImpl.getString()
and
oracle.net.ns.Packet.receive()
Do you have any ideas how to make jdbc to load data much faster? Any ideas about query optimisation, string loading optimisation, tweaking JDBC driver or using another one, direct using oracle JDBC implementations, tweaking Oracle is appreciated.
UPDATE: I compiled and listed discussion results below:
I've no access to DBMS server except connection to Oracle db and server can't connect to any external resource. Any dump and extraction utils which use server's or remote file system can't be applied, also it's impossible to install and use any external java or PL/SQL routines on server. Only connetion to execute queries - thats all.
I used profiler and digged in Oracle JDBC driver. I found out that the most expencive operation is reading data, i.e. Socket.read(). All string fields are represented as one char array and have almost no influence on perfomance. Generally, I checked with profiler the whole app and Socket.read() is definitely the most expensive operation. Extracting fields, building strings, writing data consume almost nothing. The problem is only in reading data.
Any optimisations in data representation on server side don't have real effect. Concatenating strings and converting timestamps have no result for performance.
App was rewritten to have several reader threads which put ready data in writer queue. Each thread has its own connection, no pools are used because they slow down the extraction (I used UCP pool which is recommended by oracle and it consumes about 10% of execution time, so i gave up from it). Also result set fetchSize was increased because switching from default value (10) to 50000 gives up to 50% perfomance growth.
I tested how multithreaded version works with 4 reading threads and found out that increasing readers count only slows the extraction. I tried to launch 2 instances where each of them has two readers and both worked the same time as single instance, i.e. double data extraction requires same time as single. Don't know why this happens, but it looks like oracle driver have some performance limitations. Application with 4 independent connections works slower then 2 App instances with 2 connections. (Profiler was used to ensure that driver's Socket.read() is still the main issue, all other parts works fine in multithreaded mode).
I tried to fetch all data with SAS and it can perform same extraction 2 times faster then JDBC, both used single connection to Oracle and can't use any dump operations. Oracle ensures that JDBC thin driver is as fast as native one..
Maybe Oracle have another ways to perform fast extraction to remote host through ODBC or smth else?