Improve the logic with :
- Create on instance of
PreparedStatement
and use it for every insertion
- Use batch to only send big package of insert
This would be done with something like :
private PreparedStatement pstmt;
public BatchInsertion(String sql) throws SQLException{
pstmt = conn.prepareStatement(sql)
}
public int insert(String a, String b, String c, String d) throws SQLException{
pstmt.setString(1, a);
pstmt.setString(2, b);
pstmt.setString(3, c);
pstmt.setString(4, d);
pstmt.addBatch();
return batchSize++;
}
public void sendBatch() throws SQLException{
pstmt.executeBatch();
}
There you just need to manage the insertion with that instance, when you reach the last item or say 1000 item in the batch, send it.
I used this to not be force to insert into a Collection
first.
Note: you need to close the statement at the end, I would implement AutoCloseable
on a class like this to do it and you a try-with-resource to be safe.
If you need to multi thread this insertion. I would suggest the following architecture :
Create a pool of thread, each will have a connection and a batch to insert data.
Use one queue to insert to push the data from the file.
Each Thread will take a value and add it to the batch.
With this architecture, you can easily increase the number of thread.
First, a light weight BatchInsert
class to be able to make this run :
class BatchInsert implements AutoCloseable {
private int batchSize = 0;
private final int batchLimit;
public BatchInsert(int batchLimit) {
this.batchLimit = batchLimit;
}
public void insert(String a, String b, String c, String d) {
if (++batchSize >= batchLimit) {
sendBatch();
}
}
public void sendBatch() {
System.out.format("Send batch with %d records%n", batchSize);
batchSize = 0;
}
@Override
public void close() {
if (batchSize != 0) {
sendBatch();
}
}
}
Then, I use some sort of balancer to provide a queue and a number of Thread
sharing the same queue.
class BalanceBatch {
private final List<RunnableBatch> threads = new ArrayList<>();
private Queue<String> queue = new ConcurrentLinkedQueue<>();
private static final int BATCH_SIZE = 50_000;
public BalanceBatch(int nbThread) {
IntStream.range(0, nbThread).mapToObj(i -> new RunnableBatch(BATCH_SIZE, queue)).forEach(threads::add);
}
public void send(String value) {
queue.add(value);
}
public void startAll() {
for (RunnableBatch t : threads) {
new Thread(t).start();
}
}
public void stopAll() {
for (RunnableBatch t : threads) {
t.stop();
}
}
}
Then I implement the logic to read the queue for those runnable instance. They idea is to read the queue and send it to the batch until the queue is empty and a command "STOP" is received.
class RunnableBatch implements Runnable {
private boolean started = true;
private Queue<String> queue;
private int batchLimit;
public RunnableBatch(int batchLimit, Queue<String> queue) {
this.batchLimit = batchLimit;
this.queue = queue;
}
@Override
public void run() {
try (BatchInsert batch = new BatchInsert(batchLimit)) {
while (!queue.isEmpty() || started) {
String s = queue.poll();
if (s == null) {
try {
Thread.sleep(10);
} catch (InterruptedException e) {
}
} else {
String[] values = s.split(";");
batch.insert(values[0], values[1], values[2], values[3]);
}
}
}
}
public void stop() {
started = false;
}
}
I tried this with a basic test
public static void main(String[] args) throws IOException {
createDummy("/tmp/data.txt", 25_000_000);
BalanceBatch balance = new BalanceBatch(10);
balance.startAll();
try (Stream<String> stream = Files.lines(Paths.get("/tmp/data.txt"))) {
stream.forEach(balance::send);
} catch (Exception e1) {
e1.printStackTrace();
}
balance.stopAll();
}
public static void createDummy(String file, int nbLine) throws IOException {
Files.write(Paths.get(file), (Iterable<String>) IntStream.range(0, nbLine).mapToObj(i -> String.format("A%d;B%d;C%d;D%d", i, i, i, i))::iterator);
}
This will print each batch sended and show that the last one will be quite random since the balance is not "constant". Example with 10 threads with 50k record per batch :
Send batch with 50000 records
...
Send batch with 50000 records
Send batch with 15830 records
Send batch with 15844 records
Send batch with 2354 records
Send batch with 14654 records
Send batch with 40181 records
Send batch with 44994 records
Send batch with 38376 records
Send batch with 17187 records
Send batch with 27047 records
Send batch with 33533 records
Notes:
Warning : the createDummy
function will create a file with 25_000_000 lines(I have commented it). This is roughly a file of 1GB of data
I will need more time to do some benchmark, I don't have any DB for massive insertion at the moment.
Mixing this multithreaded file reader and the batch should give you good result.
Please note this is probably not the best implementation of multithreading, I never had to work on the subject. I am open to suggestion/improvement.