Fastest way to process a file & DB insert - Java Multi-Threading

Question

I am completely new in Multi-threading coding.

This is my requirement: I have a file with 50 000 - 300 000 records.

It’s column based data (4 columns), with space as a separator. I need to split the line using space and save the records in DB in 4 columns.

I want to develop a multi-threaded application, which inserts the data to H2 DB with 4 columns (using JDBC/anything else ?) with in 2 seconds approximately. I need to change the thread pool size dynamically as per the number of records I receive.

I am developing a desktop application using Java Swings. (not a web-based app)

I am not aware if there are any better Concurrent classes which does this task much faster.

If it's not multi-threading, is there any other approach? or any other framework ?

After adding batch processing, it's taking 5 seconds approx, for 250,000 records:

    BufferedReader in = new BufferedReader(new FileReader(file));
    java.util.List<String[]> allLines = new ArrayList<String[]>(); // used for something else

    String sql = "insert into test (a, b, c, d)” +
            " values (?,?,?,?)";

    PreparedStatement pstmt = conn.prepareStatement(sql);
    int i=0;
    while ((line = in.readLine()) != null) {

        line = line.trim().replaceAll(" +", " ");
        String[] sp = line.split(" ");
        String msg = line.substring(line.indexOf(sp[5]));
        allLines.add(new String[]{sp[0] + " " + sp[1], sp[4], sp[5], msg});

        pstmt.setString(1, sp[0] + " " + sp[1]);
        pstmt.setString(2, sp[4]);
        pstmt.setString(3, sp[5]);
        pstmt.setString(4, msg);

        pstmt.addBatch();

        i++;

        if (i % 1000 == 0){
            pstmt.executeBatch();
            conn.commit();
        }
    }

    pstmt.executeBatch();

I dunno....Not up on the new drives these days...do you think your local drive can go any faster? — DevilsHnd - 退職した, Apr 13 '18 at 07:29
"I feel we can do it within 2 seconds" - why? What is the fastest time you managed reading the file? Most probably this task isn't CPU-bound but IO-bound (either on the file or the DB-side), so multithreading won't be of much use there. — piet.t, Apr 13 '18 at 07:33
I would suggest that you write your data as a CSV file and then use database mechanisms to [import CSV](http://www.h2database.com/html/tutorial.html#csv). Also a usual trick on some other databases is to drop indices before huge imports and re-creating them afterwards. — lexicore, Apr 13 '18 at 07:33
@piet.t Could *also* be CPU on big tables if there are indices. Or if the OP inserts one line at a time, there's much overhead besides IO. — lexicore, Apr 13 '18 at 07:35
But the index-maintenance is CPU-bound DB-side, so multithreading on the client-side won't do anything about it. — piet.t, Apr 13 '18 at 07:41
Are you sure multi-threading is the correct way here? Have you verified that the single-threaded performance is worse? — kutschkem, Apr 13 '18 at 07:57
Yes, single thread is too bad, it taking approximately 50 seconds to insert 250,000 records — Molay, Apr 13 '18 at 08:04
@Kumar You seem to insert data record per record. Switch to batch insertion first of all: https://stackoverflow.com/questions/4355046/java-insert-multiple-rows-into-mysql-with-preparedstatement — lexicore, Apr 13 '18 at 08:09
@Kumar See also [Fast Database Import](http://h2database.com/html/performance.html#fast_import) documentation of H2. — lexicore, Apr 13 '18 at 08:09
@piet.t I never said multithreading will do anything about it. The OP seems to insert one record at time, I think multithreading is effective probably because overhead is parallelized. — lexicore, Apr 13 '18 at 08:13
@piet.t But with 30MB it can't really be IO, 30MB is peanuts. — lexicore, Apr 13 '18 at 08:14
Check [How to read files in multithreaded mode?](https://stackoverflow.com/questions/17349862/how-to-read-files-in-multithreaded-mode?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa) — AxelH, Apr 13 '18 at 08:20
how are you parsing the text file? maybe it could be optimized too — Fran Montero, Apr 13 '18 at 08:20
@AxelH With 30MB I really don't think there's any need to optimize IO. — lexicore, Apr 13 '18 at 08:24
Agreed @lexicore, I should have added that this was not the part to multithread (or not even the thing to do) — AxelH, Apr 13 '18 at 08:25
@Kumar Are you using H2 as a local database or remotely in server mode? — lexicore, Apr 13 '18 at 08:34
It's bothering me that you have edited your question including the batch solution ... now these answers seems off topic. Even more problematic, you were expecting to insert 50k records in 2sec instead of 4s (divide time by two), you said that you have inserted 250k in 5sec, previously in 12sec. So this is what you were expectiong. What are you expecting more ? Mutlithreading would only allow you to manage multiple connection and multiple batch but I expect the file reading to be a bottleneck. Do some benchmark to see what part take most of the time and we can focus on that. — AxelH, Apr 13 '18 at 10:28
sorry for that, instead of making my original post too lengthy i edited the existing thread. My intent was to insert the records irrespective of number of records in the file, all those has to be inserted under 2 seconds. After changing my code to batch processing, 50k records are saved with in 1 sec, that's good, but if i use 250k records, it's taking 5 seconds, that's the problem, this should also be under 2 sec, In future if i get 500k records, all those records should be saved with in 2 seconds, hope i am clear. Is there anyway i can achive this ? multithreading or by any other way ? — Molay, Apr 13 '18 at 11:27
@Kumar I tried to give you the most complete solution I could think off. This is probably not the best design or implementation but the architecture is there. This is the only improvement I can think off in your case. Or like lexicore said, change the language ;) — AxelH, Apr 13 '18 at 12:09

AxelH · Answer 1 · 2018-04-13T12:07:57.070

Improve the logic with :

Create on instance of PreparedStatement and use it for every insertion
Use batch to only send big package of insert

This would be done with something like :

private PreparedStatement pstmt;

public BatchInsertion(String sql) throws SQLException{
    pstmt = conn.prepareStatement(sql)
}

public int insert(String a, String b, String c, String d) throws SQLException{
    pstmt.setString(1, a);
    pstmt.setString(2, b);
    pstmt.setString(3, c);
    pstmt.setString(4, d);

    pstmt.addBatch();
    return batchSize++;
}

public void sendBatch() throws SQLException{
    pstmt.executeBatch();
}

There you just need to manage the insertion with that instance, when you reach the last item or say 1000 item in the batch, send it.

I used this to not be force to insert into a Collection first.

Note: you need to close the statement at the end, I would implement AutoCloseable on a class like this to do it and you a try-with-resource to be safe.

If you need to multi thread this insertion. I would suggest the following architecture :

Create a pool of thread, each will have a connection and a batch to insert data. Use one queue to insert to push the data from the file. Each Thread will take a value and add it to the batch.

With this architecture, you can easily increase the number of thread.

First, a light weight BatchInsert class to be able to make this run :

class BatchInsert implements AutoCloseable {

    private int batchSize = 0;
    private final int batchLimit;

    public BatchInsert(int batchLimit) {
        this.batchLimit = batchLimit;
    }

    public void insert(String a, String b, String c, String d) {
        if (++batchSize >= batchLimit) {
            sendBatch();
        }
    }

    public void sendBatch() {
        System.out.format("Send batch with %d records%n", batchSize);
        batchSize = 0;
    }

    @Override
    public void close() {
        if (batchSize != 0) {
            sendBatch();
        }
    }
}

Then, I use some sort of balancer to provide a queue and a number of Thread sharing the same queue.

class BalanceBatch {
    private final List<RunnableBatch> threads = new ArrayList<>();

    private Queue<String> queue = new ConcurrentLinkedQueue<>();
    private static final int BATCH_SIZE = 50_000;

    public BalanceBatch(int nbThread) {
        IntStream.range(0, nbThread).mapToObj(i -> new RunnableBatch(BATCH_SIZE, queue)).forEach(threads::add);
    }

    public void send(String value) {
        queue.add(value);
    }

    public void startAll() {
        for (RunnableBatch t : threads) {
            new Thread(t).start();
        }
    }

    public void stopAll() {
        for (RunnableBatch t : threads) {
            t.stop();
        }
    }
}

Then I implement the logic to read the queue for those runnable instance. They idea is to read the queue and send it to the batch until the queue is empty and a command "STOP" is received.

class RunnableBatch implements Runnable {

    private boolean started = true;
    private Queue<String> queue;
    private int batchLimit;

    public RunnableBatch(int batchLimit, Queue<String> queue) {
        this.batchLimit = batchLimit;
        this.queue = queue;
    }

    @Override
    public void run() {
        try (BatchInsert batch = new BatchInsert(batchLimit)) {
            while (!queue.isEmpty() || started) {
                String s = queue.poll();
                if (s == null) {
                    try {
                        Thread.sleep(10);
                    } catch (InterruptedException e) {

                    }
                } else {
                    String[] values = s.split(";");
                    batch.insert(values[0], values[1], values[2], values[3]);
                }
            }
        }
    }

    public void stop() {
        started = false;
    }
}

I tried this with a basic test

public static void main(String[] args) throws IOException {
    createDummy("/tmp/data.txt", 25_000_000);

    BalanceBatch balance = new BalanceBatch(10);

    balance.startAll();
    try (Stream<String> stream = Files.lines(Paths.get("/tmp/data.txt"))) {
        stream.forEach(balance::send);
    } catch (Exception e1) {
        e1.printStackTrace();
    }
    balance.stopAll();
}

public static void createDummy(String file, int nbLine) throws IOException {
    Files.write(Paths.get(file), (Iterable<String>) IntStream.range(0, nbLine).mapToObj(i -> String.format("A%d;B%d;C%d;D%d", i, i, i, i))::iterator);
}

This will print each batch sended and show that the last one will be quite random since the balance is not "constant". Example with 10 threads with 50k record per batch :

Send batch with 50000 records
...
Send batch with 50000 records
Send batch with 15830 records
Send batch with 15844 records
Send batch with 2354 records
Send batch with 14654 records
Send batch with 40181 records
Send batch with 44994 records
Send batch with 38376 records
Send batch with 17187 records
Send batch with 27047 records
Send batch with 33533 records

Notes:

Warning : the createDummy function will create a file with 25_000_000 lines(I have commented it). This is roughly a file of 1GB of data

I will need more time to do some benchmark, I don't have any DB for massive insertion at the moment.

Mixing this multithreaded file reader and the batch should give you good result.
Please note this is probably not the best implementation of multithreading, I never had to work on the subject. I am open to suggestion/improvement.

score 1 · Answer 2 · answered Apr 13 '18 at 08:25

For example i create a csv file with 300000 records and time to read and to add to DB were Time Taken=2625. Use try OpenCSV to read records from file and after put them to DB like this. When you put to DB user prepared statement and executeBatch()

//try block, connection...
PreparedStatement preparedStatement = connection.prepareStatement(query);
for(int i = 0; i < recordsCount; i++){
    preparedStatement.setString(1, rec1);
    preparedStatement.setString(2, rec2);
    preparedStatement.setString(3, rec3);
    preparedStatement.setString(4, rec4);
    preparedStatement.addBatch();
    if(i%500 == 0) preparedStatement.executeBatch();
}
preparedStatement.executeBatch();

PreparedStatement with executeBatch() is faster than executeQuery, because you didn't create a lot of queries. Example of performance(see tests)

score 0 · Answer 3 · answered Apr 13 '18 at 08:39

0

The problem in your example is that you create prepared statement for each values entry.

Batch execution is an option, but I would build a single insert statement with multiple values like this:

insert into data (a, b, c, d) 
values (a1, b1, c1, d1), (a2, b2, c2, d2), (a3, b3, c3, d3)...

then your can execute it once and that's it.

answered Apr 13 '18 at 08:39

serhii

1,135
2
12
29

1

How do you expect to manage this query using a preparedStatement ? Use a batch instead – AxelH Apr 13 '18 at 08:43
it is not hard to build a query with StringBuilder placing (?, ?, ?, ?), in necessary amount. Then fill it with the data. – serhii Apr 13 '18 at 08:45
1

But why ? What do you have against the batch ? It is there to prevent this kind of complexity... It isn't hard but it is a poor solution to use. – AxelH Apr 13 '18 at 08:48
1

This seems to be invalid in H2 SQL syntax: http://www.h2database.com/html/grammar.html#insert – lexicore Apr 13 '18 at 08:51
do you see `,...` after values bracket closing? it seems to be correct syntaxt – serhii Apr 13 '18 at 09:01
Are you sure, it's a good solution for huge data? i am getting error at execute() statement. And system is hanged and taking about 3-4 seconds to get into normal state. Tried with 250,000 records. – Molay Apr 13 '18 at 09:24
not 100% sure. can you show your code? (paste it somewhere) – serhii Apr 13 '18 at 09:43

score 0 · Answer 4 · edited Jun 20 '20 at 09:12

0

Other answers have already pointed out that you should use batch insertions. I think for fastest imports you shouldn't actually go through Java at all.

See Fast Database Import in the H2 documentation:

To speed up large imports, consider using the following options temporarily:

SET LOG 0 (disabling the transaction log)

SET CACHE_SIZE (a large cache is faster)

SET LOCK_MODE 0 (disable locking)

SET UNDO_LOG 0 (disable the session undo log)

These options can be set in the database URL: jdbc:h2:~/test;LOG=0;CACHE_SIZE=65536;LOCK_MODE=0;UNDO_LOG=0. Most of those options are not recommended for regular use, that means you need to reset them after use.

If you have to import a lot of rows, use a PreparedStatement or use CSV import. Please note that CREATE TABLE(...) ... AS SELECT ... is faster than CREATE TABLE(...); INSERT INTO ... SELECT ....

Another trick I know from some other databases is to drop indices before inserting and recreating them after inserting. I have no idea if this has any effect on H2, but wanted to mention as a potential thing to try.

You can import CSV simply with:

INSERT INTO MY_TABLE(...) SELECT * FROM CSVREAD('data.csv');

H2 documentation mentions that CREATE TABLE(...) ... AS SELECT ...; is faster, but I assume you want to insert data into an existing table rather than create a new one.

For this approach to work, your database needs access to the data.csv file. It is trivial with local database, but not so easy if you use a remote database.

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 13 '18 at 08:48

lexicore

42,748
17
132
221

have updated my insert script, using executeBatch, it's taking around 5 seconds for 250,000 records – Molay Apr 13 '18 at 08:52
cannot use CSVREAD as i have other data as well, which i need to ignore before inserting them to DB. so PreparedStatement is the ideal one for my requirement. – Molay Apr 13 '18 at 08:55
What do you mean by "other data"? If it's unnecessary columns, you can configur it in the query. – lexicore Apr 13 '18 at 08:56
other data means other columns - which are not supposed to get into DB. BTW, i am getting TXT file as input, which contains 10 columns with space(multiple or single) difference, out of them only 4 columns i need to pick and insert into my DB. Have . updated my original post with the actual logic. – Molay Apr 13 '18 at 09:04
after using LOG=0;CACHE_SIZE=65536;LOCK_MODE=0;UNDO_LOG=0 - data is getting loaded with in 4 seconds (1sec saved) for 250,000 records – Molay Apr 13 '18 at 09:07
@Kumar You can insert a subset of collumns, it's not a problem. In [`CSVREAD`](http://www.h2database.com/html/functions.html#csvread) you can configure delimiters so maybe you can read your TXT file as a CSV file. Maybe not. – lexicore Apr 13 '18 at 09:15
@Kumar If your code is already working, maybe you could post it on Code Review and ask if further optimizations are possible. – lexicore Apr 13 '18 at 09:17
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/168905/discussion-between-kumar-and-lexicore). – Molay Apr 13 '18 at 09:28

Fastest way to process a file & DB insert - Java Multi-Threading

4 Answers4

Notes: