0

I'm trying to read over 17 000 files (each file contains from 100 to 23 000 rows) and parse the data into a mysql database. The problem is that it is doing this too slow, and I don't know where the bottle neck is.

private void readFile() { 
    PreparedStatement prepStatement = null;

    String queryInsItem = "INSERT IGNORE INTO item VALUES(?)";

    String queryInsUser = "INSERT IGNORE INTO user VALUES(?)";

    String queryInsRating = "INSERT IGNORE INTO rating VALUES(?,?,?,?)";

    try {
        int x = 1;
        int itemID = 0;
        int userID = 0;
        int rating = 0;
        java.util.Date date = null;
        java.sql.Date sqlDate = null;
        DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd", Locale.ENGLISH);
        String line = null;

        conn.setAutoCommit(false);
        System.out.println("Loading...");
          File dir = new File("src/bigdata/training_set/");
          File[] directoryListing = dir.listFiles();
          if (directoryListing != null) {
            for (File itemFile : directoryListing) {
                in = new BufferedReader(new FileReader(itemFile));
                line = in.readLine();
                itemID = Integer.parseInt(line.substring(0, line.length()-1));
                userID = 0;
                rating = 0;
                date = null;
                sqlDate = null;

                // Add to item table
                prepStatement = conn.prepareStatement(queryInsItem);
                prepStatement.setInt(1, itemID);
                prepStatement.executeUpdate();
                conn.commit();
                prepStatement.close();


                while ((line = in.readLine()) != null) {
                    // Splits the line to corresponding variables
                    userID = Integer.parseInt(line.substring(0, line.indexOf(",")));
                    rating = Integer.parseInt(line.substring(line.indexOf(",")+1, line.lastIndexOf(",")));
                    date= dateFormat.parse(line.substring(line.lastIndexOf(",")+1, line.length()));

                    sqlDate = new java.sql.Date(date.getTime());

                    // Add to user table
                    prepStatement = conn.prepareStatement(queryInsUser);
                    prepStatement.setInt(1, userID);
                    prepStatement.executeUpdate();
                    conn.commit();
                    prepStatement.close();

                    // Add to rating table
                    prepStatement = conn.prepareStatement(queryInsRating);
                    prepStatement.setInt(1, userID);
                    prepStatement.setInt(2, itemID);
                    prepStatement.setInt(3, rating);
                    prepStatement.setDate(4, sqlDate);
                    prepStatement.executeUpdate();
                    conn.commit();
                    prepStatement.close();

                }
                in.close();
                System.out.println("File " + x++ +" done.");
            }
          }


    } catch (IOException | ParseException | SQLException e) {e.printStackTrace();}

    System.out.println("Done.");
}

I've tried to str.split the lines first and then changed it to indexOf/lastIndexOf, but there is no noticeable improvement as people mentioned in 19486077. Someone else in the same thread mentioned to use Threads, but is this a right approach in my case?

Here is a snippet of the raw data:

5317:
2354291,3,2005-07-05
185150,2,2005-07-05
868399,3,2005-07-05

Above means:

[item_id]:
[user_id],[rating],[date]
[user_id],[rating],[date]
[user_id],[rating],[date]
Community
  • 1
  • 1
user2097519
  • 55
  • 1
  • 7
  • 1
    MySQL has native support for loading text files into a database table, and it's usually much faster than writing code with insert statements: https://dev.mysql.com/doc/refman/5.1/en/load-data.html Perhaps you'd be better suited to transform the file into a format that can be loaded into MySQL via LOAD DATA INFILE. It's faster in my experience. – kuujo Mar 24 '15 at 05:28

1 Answers1

3
  • If you have AUTO_INCREMENT PRIMARY KEY on those tables, beware that INSERT IGNORE will burn ids like crazy.
  • "Batch" the inserts. If you collect 100-1000 rows, build a single INSERT with those rows, then execute the statement, the INSERTion will run 10 times as fast.
  • Don't try to batch 23,000 rows at once, you are likely to run into some problems (hard to predict what problems).
  • On the other hand, if you can do LOAD DATA against those files, you can get rid of all your parsing code! And it will run at least as fast as Batched inserts.
Rick James
  • 135,179
  • 13
  • 127
  • 222