3

I am trying to improve my Java app's performance and I'm focusing at this point on one end point which has to insert a large amount of data into mysql.

I'm using plain JDBC with the MariaDB Java client driver:

try (PreparedStatement stmt = connection.prepareStatement(
            "INSERT INTO data (" +
                    "fId, valueDate, value, modifiedDate" +
                    ") VALUES (?,?,?,?)") {
    for (DataPoint dp : datapoints) {
        stmt.setLong(1, fId);
        stmt.setDate(2, new java.sql.Date(dp.getDate().getTime()));
        stmt.setDouble(3, dp.getValue());
        stmt.setDate(4, new java.sql.Date(modifiedDate.getTime()));
        stmt.addBatch();
    }        
    int[] results = statement.executeBatch();
}

From populating the new DB from dumped files, I know that max_allowed_packet is important and I've got that set to 536,870,912 bytes.

In https://dev.mysql.com/doc/refman/5.7/en/insert-optimization.html it states that:

If you are inserting many rows from the same client at the same time, use INSERT statements with multiple VALUES lists to insert several rows at a time. This is considerably faster (many times faster in some cases) than using separate single-row INSERT statements. If you are adding data to a nonempty table, you can tune the bulk_insert_buffer_size variable to make data insertion even faster. See Section 5.1.7, “Server System Variables”.

On my DBs, this is set to 8MB

I've also read about key_buffer_size (currently set to 16MB).

I'm concerned that these last 2 might not be enough. I can do some rough calculations on the JSON input to this algorithm because it looks someething like this:

[{"actualizationDate":null,"data":[{"date":"1999-12-31","value":0},
{"date":"2000-01-07","value":0},{"date":"2000-01-14","value":3144},
{"date":"2000-01-21","value":358},{"date":"2000-01-28","value":1049},
{"date":"2000-02-04","value":-231},{"date":"2000-02-11","value":-2367},
{"date":"2000-02-18","value":-2651},{"date":"2000-02-25","value":-
393},{"date":"2000-03-03","value":1725},{"date":"2000-03-10","value":-
896},{"date":"2000-03-17","value":2210},{"date":"2000-03-24","value":1782},

and it looks like the 8MB configured for bulk_insert_buffer_size could easily be exceeded, if not key_buffer_size as well.

But the MySQL docs only make mention of MyISAM engine tables, and I'm currently using InnoDB tables.

I can set up some tests but it would be good to know how this will break or degrade, if at all.

[EDIT] I have --rewriteBatchedStatements=true. In fact here's my connection string:

jdbc:p6spy:mysql://myhost.com:3306/mydb\
    ?verifyServerCertificate=true\
    &useSSL=true\
    &requireSSL=true\
    &cachePrepStmts=true\
    &cacheResultSetMetadata=true\
    &cacheServerConfiguration=true\
    &elideSetAutoCommits=true\
    &maintainTimeStats=false\
    &prepStmtCacheSize=250\
    &prepStmtCacheSqlLimit=2048\
    &rewriteBatchedStatements=true\
    &useLocalSessionState=true\
    &useLocalTransactionState=true\
    &useServerPrepStmts=true

(from https://github.com/brettwooldridge/HikariCP/wiki/MySQL-Configuration )

Adam
  • 5,215
  • 5
  • 51
  • 90
  • 1
    [This answer](https://stackoverflow.com/a/26313288/2144390) may be of interest. – Gord Thompson May 02 '18 at 12:54
  • 1
    You can simply execute the batch every #### insertion. Optimizing this will probably be a pain where you can simply execute the batch from time to time. PS: Be aware that "_This is considerably faster (many times faster in some cases) than using separate single-row INSERT_" is not about prepared statement, a batch is sending one compiled query (based on the DB) and a batch of parameters. It is more efficient than executing an insert query on each iteration. – AxelH May 02 '18 at 12:59
  • @AxelH yes, great point, something I was too busy to realise... – Adam May 02 '18 at 13:02

1 Answers1

4

An alternative is to execute the batch from time to time. This allows you to reduce the size of the batchs and let you focus on more important problems.

int batchSize = 0;

for (DataPoint dp : datapoints) {
    stmt.setLong(1, fId);
    stmt.setDate(2, new java.sql.Date(dp.getDate().getTime()));
    stmt.setDouble(3, dp.getValue());
    stmt.setDate(4, new java.sql.Date(modifiedDate.getTime()));
    stmt.addBatch();

    //When limit reach, execute and reset the counter
    if(batchSize++ >= BATCH_LIMIT){
        statement.executeBatch();

        batchSize = 0;
    }
}        

// To execute the remaining items
if(batchSize > 0){
    statement.executeBatch();
}

I generally use a constant or a parameter based on the DAO implementation to be more dynamic but a batch of 10_000 row is a good start.

private static final int BATCH_LIMIT = 10_000;

Note that this is not necessary to clear the batch after an execution. Even if this is not specified in Statement.executeBatch documentation, this is in the JDBC specification 4.3

14 Batch Updates
14.1 Description of Batch Updates
14.1.2 Successful Execution

Calling the method executeBatch closes the calling Statement object’s current result set if one is open.
The statement’s batch is reset to empty once executeBatch returns.

The management of the result is a bit more complicated but you can still concatenate the results if you need them. This can be analyzed at any time since the ResultSet is not needed anymore.

Community
  • 1
  • 1
AxelH
  • 14,325
  • 2
  • 25
  • 55
  • 1
    It's a shame there's no method on `Statement` giving the number of bytes already added to the batch. It would be great to generify the handling somehow to set the batch count dependent on bytes added. – Adam May 02 '18 at 13:41
  • 1
    Calling `clearBatch()` shouldn't be necessary; that should be done automatically on `executeBatch()`. From the JDBC 4.3 specification (section 14.1.2) : _"The statement’s batch is reset to empty once `executeBatch` returns."_ – Mark Rotteveel May 02 '18 at 16:07
  • I will check once I reach my laptop @MarkRotteveel, I quickly checked the documentation of the method but it was not specified so I preferred to be safe ;) didn't went to the jdbc specifications. Thanks for that by the way! – AxelH May 02 '18 at 16:35
  • 1
    Thanks @MarkRotteveel, I have adapted the answer based on your input. I should take a bit more time on that specification, I never really took the time to overview it. – AxelH May 03 '18 at 05:35