2

I am reading a pipe delimited file, and loading to a table in a MySQL database. I eventually need to have this script run off an application server in Oregon, and load to the table in North Carolina. So yes, I realize there is going to be a sizeable network latency involved. Also, to address this to some extent, I made sure I was performing array inserts, and not row-by-row.

However, I am confused as to why I am seeing a dramatic difference in performance when I do this through Java, and MySQL JDBC versus, using Perl/DBI. When I run my inserts in batch sizes of 1000 rows at a time, Perl finishes in 8 seconds, while the java application takes 137 seconds (2 mins and 17 secs)

I didn't expect identical performance, but not a dramatic difference as this. Thoughts?

I am only a Java beginner, so its quite possible, I am leaving out something obvious! hence wanted to reach out.

Here are the relevant environments: Java: 1.6.0_31 MySQL JDBC: 5.1.31 perl: 5.10.1

Also, what else can I do to make inserts over WAN faster? Does MySQL support compressed data ingestion? I have some massive compute capacity on my application cluster.

public class testMySql_Load {

  public static void main(String[] args) throws IOException, ClassNotFoundException, SQLException, ParseException{
    int i = 0;      
    FileInputStream inputStream = new FileInputStream("pipe_delimited_text_file");
    Scanner sc = null;
    Connection connect = null;
    try {

      // this will load the MySQL driver, each DB has its own driver
      Class.forName("com.mysql.jdbc.Driver");       
      String connectionString = "jdbc:mysql://host:port/database?user=user&password=password";
      System.out.println(connectionString);
      // setup the connection with the DB.
      connect = DriverManager.getConnection(connectionString);
      connect.setAutoCommit(false);
      System.out.println("Connected");

      PreparedStatement truncateStatement = connect.prepareStatement("Truncate table target_table");
      truncateStatement.executeUpdate();

      // preparedStatements can use variables and are more efficient
      PreparedStatement preparedStatement = connect.prepareStatement(
        "INSERT INTO target_table( target_table.ds, ...... ) VALUES ( ?, .... )");      

      sc = new Scanner(inputStream, "UTF-8");
      while (sc.hasNextLine()) {
        i++;
        if(i>1000){
          break;
        }
        String[] line = sc.nextLine().split("\\|");
        //System.out.println(line);

        if(line[0].length() == 0) { preparedStatement.setNull(1, java.sql.Types.DATE); } else {
          preparedStatement.setDate(1, new java.sql.Date(new SimpleDateFormat("M/d/y", Locale.ENGLISH).parse(line[0].substring(0,Math.max(0, Math.min(line[0].length()-1, 19)))).getTime()));
        }
        preparedStatement.setInt(2, Integer.valueOf(line[1])); 
        preparedStatement.setString(3, line[2]); 
        preparedStatement.setString(4, line[3]); 
        preparedStatement.setString(5, line[4]); 
        preparedStatement.setDouble(6, Double.valueOf(line[5])); 
        preparedStatement.setDouble(7, Double.valueOf(line[6])); 
        preparedStatement.setString(8, line[7]); 
        preparedStatement.setString(9, line[8]); 
        if(line[9].length() == 0) { preparedStatement.setNull(10, java.sql.Types.DATE); } else {        
          preparedStatement.setTimestamp(10, new java.sql.Timestamp(new SimpleDateFormat("M/d/y H:m:s", Locale.ENGLISH).parse(line[9].substring(0,Math.max(0, Math.min(line[9].length()-1, 19)))).getTime()));
        }
        preparedStatement.setString(11, line[10]); 
        preparedStatement.setString(12, line[11]); 
        preparedStatement.setString(13, line[12]); 
        preparedStatement.setString(14, line[13]); 
        preparedStatement.setString(15, line[14]); 
        preparedStatement.setString(16, line[15]); 
        preparedStatement.setString(17, line[16]); 
        preparedStatement.setString(18, line[17]); 
        if(line[18].length() == 0) { preparedStatement.setNull(19, java.sql.Types.DATE); } else {
          preparedStatement.setTimestamp(19, new java.sql.Timestamp(new SimpleDateFormat("M/d/y H:m:s", Locale.ENGLISH).parse(line[18].substring(0,Math.max(0, Math.min(line[18].length()-1, 19)))).getTime()));
        }
        preparedStatement.setString(20, line[19]); 
        preparedStatement.setString(21, line[20]); 
        preparedStatement.setString(22, line[21]); 
        preparedStatement.setString(23, line[22]); 
        preparedStatement.setString(24, line[23]); 
        if(line[24].length() == 0) { preparedStatement.setNull(25, java.sql.Types.DATE); } else {
          preparedStatement.setTimestamp(25, new java.sql.Timestamp(new SimpleDateFormat("M/d/y H:m:s", Locale.ENGLISH).parse(line[24].substring(0,Math.max(0, Math.min(line[24].length()-1, 19)))).getTime()));
        }
        if(line[25].length() == 0) { preparedStatement.setNull(26, java.sql.Types.DATE); } else {
          preparedStatement.setTimestamp(26, new java.sql.Timestamp(new SimpleDateFormat("M/d/y H:m:s", Locale.ENGLISH).parse(line[25].substring(0,Math.max(0, Math.min(line[25].length()-1, 19)))).getTime()));
        }
        preparedStatement.setString(27, line[26]); 
        preparedStatement.setString(28, line[27]); 
        preparedStatement.setString(29, line[28]); 
        preparedStatement.setString(30, line[29]); 
        preparedStatement.setString(31, line[30]); 
        preparedStatement.setString(32, line[31]); 
        if(line[32].length() == 0) { preparedStatement.setNull(33, java.sql.Types.DATE); } else {
          preparedStatement.setTimestamp(33, new java.sql.Timestamp(new SimpleDateFormat("M/d/y H:m:s", Locale.ENGLISH).parse(line[32].substring(0,Math.max(0, Math.min(line[32].length()-1, 19)))).getTime()));
        }
        preparedStatement.setString(34, line[33]); 
        preparedStatement.setString(35, line[34]); 
        preparedStatement.setString(36, line[35]); 
        if(line[36].length() == 0) { preparedStatement.setNull(37, java.sql.Types.DATE); } else {
          preparedStatement.setTimestamp(37, new java.sql.Timestamp(new SimpleDateFormat("M/d/y H:m:s", Locale.ENGLISH).parse(line[36].substring(0,Math.max(0, Math.min(line[36].length()-1, 19)))).getTime()));
        }
        preparedStatement.setString(38, line[37]); 
        preparedStatement.setString(39, line[38]); 
        if(line[39].length() == 0) { preparedStatement.setNull(40, java.sql.Types.DATE); } else {
          preparedStatement.setTimestamp(40, new java.sql.Timestamp(new SimpleDateFormat("M/d/y H:m:s", Locale.ENGLISH).parse(line[39].substring(0,Math.max(0, Math.min(line[39].length()-1, 19)))).getTime()));
        }
        if(line[40].length() == 0) { preparedStatement.setNull(41, java.sql.Types.DATE); } else {
          preparedStatement.setTimestamp(41, new java.sql.Timestamp(new SimpleDateFormat("M/d/y H:m:s", Locale.ENGLISH).parse(line[40].substring(0,Math.max(0, Math.min(line[40].length()-1, 19)))).getTime()));
        }
        preparedStatement.setString(42, line[41]); 
        preparedStatement.setString(43, line[42]); 
        preparedStatement.setString(44, line[43]); 
        if(line[44].length() == 0) { preparedStatement.setNull(45, java.sql.Types.DATE); } else {
          preparedStatement.setDate(45, new java.sql.Date(new SimpleDateFormat("M/d/y", Locale.ENGLISH).parse(line[44].substring(0,Math.max(0, Math.min(line[44].length()-1, 19)))).getTime()));
        }
        preparedStatement.setDouble(46, Double.valueOf(line[45])); 
        preparedStatement.setString(47, line[46]); 
        preparedStatement.setString(48, line[47]); 
        preparedStatement.setString(49, line[48]); 
        preparedStatement.setString(50, line[49]); 
        preparedStatement.setString(51, line[50]); 
        preparedStatement.setString(52, line[51]); 
        preparedStatement.setString(53, line[52]); 
        preparedStatement.setString(54, line[53]); 
        preparedStatement.setString(55, line[54]); 
        preparedStatement.setString(56, line[55]); 
        preparedStatement.setString(57, line[56]); 
        preparedStatement.setInt(58, Integer.valueOf(line[57])); 
        if(line.length == 59){ if(  line[58].length() == 0) { preparedStatement.setNull(59, java.sql.Types.DATE); } else {
          preparedStatement.setTimestamp(59, new java.sql.Timestamp(new SimpleDateFormat("M/d/y H:m:s", Locale.ENGLISH).parse(line[58].substring(0,Math.max(0, Math.min(line[58].length()-1, 19)))).getTime()));
        }} else {preparedStatement.setNull(59, java.sql.Types.DATE);}
        preparedStatement.addBatch();

      }

      preparedStatement.executeBatch();
      connect.commit();
      preparedStatement.close();

      // note that Scanner suppresses exceptions
      if (sc.ioException() != null) {
        throw sc.ioException();
      }
    } finally {
      if (inputStream != null) {
        inputStream.close();
      }
      if (sc != null) {
        sc.close();
      }
      if (connect != null) {
        connect.close();
      }      
    }
  }

}
Bi Act
  • 334
  • 2
  • 3
  • 14
  • You haven't posted your implementation in the Java side, so we are not sure if you have an optimized or a poor implementation in each side. – Luiggi Mendoza Nov 23 '14 at 20:46
  • Such performance can only be explained by a bad implementation. i.e : In your Java program, are you creating a new connection for each insert ? Posting code might help.. –  Nov 23 '14 at 20:52
  • Thanks for the response, Luiggi! If you were to assume they were both as optimized as they could be, what would be your next set of 'items' to look at? I could be missing something, but if I am batching records in both techniques, there is really not much to optimize in this code either way, is there? – Bi Act Nov 23 '14 at 20:53
  • To use a profiler to determine the bottleneck in the app. And by your description, I still think you have a poor implementation in the Java side. Unless you show some code, I cannot think otherwise. Even if you think you have optimized all you can, the profiler can demonstrate the contrary. – Luiggi Mendoza Nov 23 '14 at 20:55
  • Just added the code - thanks Luiggi - is there an article/tutorial I can refer to, for using profiler? – Bi Act Nov 23 '14 at 21:01
  • Thanks Sujan! No, I am not creating a new connection for every insert. I just updated the post with new code. – Bi Act Nov 23 '14 at 21:02
  • 1
    You might find [this question](http://stackoverflow.com/q/26307760/2144390) interesting. – Gord Thompson Nov 23 '14 at 21:04
  • Apart of what @GordThompson added, you're executing batch statements from 1 to 1. And you can reuse your `PreparedStatement`, there's no need to close it on every iteration of your `while` loop. – Luiggi Mendoza Nov 23 '14 at 21:09
  • Along with it, there are other objects that can be created once and be reused through the entire app like that `SimpleDateFormat` – Luiggi Mendoza Nov 23 '14 at 21:11
  • @GordThompson Thanks thats very interesting, and likely helpful! – Bi Act Nov 23 '14 at 21:24
  • @LuiggiMendoza, I dont see where I am closing the preparedstatement with every while loop iteration. I AM instantiating SimpleDateFormat , and that I will change to being done once. Thanks! – Bi Act Nov 23 '14 at 21:29
  • BOOYAH!!! @GordThompson That was it - I added the rewriteBatchedStatements parm and the java app completed inserting 1000 rows in 5 secs! Thanks - how do I mark this question as being answered by your comment. Do you have add an 'answer' as opposed to a comment? – Bi Act Nov 23 '14 at 21:32
  • I'll write up an answer. – Gord Thompson Nov 23 '14 at 21:33

1 Answers1

1

When performing JDBC batch inserts into MySQL over a network connection we can improve performance by adding rewriteBatchedStatements=true to the JDBC connection URL. Without that directive, .executeBatch() will cause MySQL Connector/J to send individual INSERT statements like

INSERT INTO tablename (field1,field2) VALUES (value1,value2)
INSERT INTO tablename (field1,field2) VALUES (value3,value4)

With rewriteBatchedStatements=true, MySQL Connector/J will send multi-row inserts like

INSERT INTO tablename (field1, field2) VALUES (value1,value2),(value3,value4)

That can save quite a bit of bandwidth is the field list is long.

(For more details, see the related question here.)

Community
  • 1
  • 1
Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • Awesome - that gave me a huge performance boost! Now onto working with the diminishing returns! – Bi Act Nov 23 '14 at 22:07