3

I'm using the program below to insert values from very large .csv files (~2.5 million lines) into an SQLite DB. It starts very fast, but begins to slow over time, before eventually hanging indefinitely at around 900,000 lines. My hunch says that it's eating up memory somehow, but not quite a memory leak, since it never throws an OutOfMemoryException or similar. To be clear, the program never fails, or crashes. It just gets slower until it stops progressing. All other processes on my laptop are also affected, and eventually it takes ~10 seconds to even register mouse movements.

I'm not very experienced with databases, so it could easily be something stupid I'm doing with how I execute the INSERT statement. The most recent modification I made was to use PreparedStatement.addBatch() and PreparedStatement.executeBatch() and despite reading the documentation, I'm still not very clear on whether I'm using them correctly. I'm using sqlite-jdbc-3.7.2.jar if that makes a difference.

public class Database{

        public static void main(String[] args){
            Connection c = connect("db.db");
//            createTable(c);
            addCSVToDatabase(c, "test-10000.csv");
//            print(c);
            disconnect(c);
        }

        public static void createTable(Connection c) {
            Statement stmt;
            String sql = "CREATE TABLE results("
                    + "ID            INTEGER    NOT NULL    PRIMARY KEY AUTOINCREMENT, "
                    + "TITLE         TEXT       NOT NULL, "
                    + "URL           TEXT       NOT NULL    UNIQUE, "
                    + "BEAN  BLOB"
                    + ");"; 
            System.out.println("QUERY: " + sql);
            try {
                stmt = c.createStatement();
                stmt.executeUpdate(sql);
            } catch (SQLException e) { e.printStackTrace();}
        }

        public static void addCSVToDatabase(Connection c, String csvFile){

            BufferedReader reader = null;
            int x = 0;
            DBEntryBean b;
            String[] vals;
            ByteArrayOutputStream baos = null;
            ObjectOutputStream oos = null;
            PreparedStatement pstmt = null;
            String sql = "INSERT OR IGNORE INTO results("
                    + "TITLE, "
                    + "URL, "
                    + "BEAN"
                    + ") VALUES(?, ?, ?);";
            try{
                pstmt = c.prepareStatement(sql);
                reader = new BufferedReader(new InputStreamReader(new FileInputStream(csvFile), "UTF-8"));
                c.setAutoCommit(false);

                for(String line; (line = reader.readLine()) != null;){

                    vals = line.split("\\|"); // Each line is of the form: "title|URL|...|...|..."
                    b = new DBEntryBean();
                    b.setTitle(vals[0]);
                    b.setURL(vals[1]);

                    pstmt.setString(Constants.DB_COL_TITLE, b.getTitle());      
                    pstmt.setString(Constants.DB_COL_URL, b.getURL());  

                    // Store the DBEntryBean in the table so I can retrieve it, rather than construct a new one every time I need it.
                    baos = new ByteArrayOutputStream();
                    oos = new ObjectOutputStream(baos);
                    oos.writeObject(b);
                    pstmt.setBytes(Constants.DB_COL_BEAN, baos.toByteArray());
                    pstmt.addBatch();
                    pstmt.executeBatch(); 
                    System.out.println("Line: " + x++);
                }
            } catch (Exception e){ e.printStackTrace(); 
            } finally{
                try{ 
                    if(pstmt != null){ pstmt.close(); }
                    c.setAutoCommit(true);
                } catch (SQLException e) { e.printStackTrace(); }
            }
        }

        private static Connection connect(String path) {

            String url = "jdbc:sqlite:" + path;
            Connection conn = null;
            try {
                Class.forName("org.sqlite.JDBC");
                conn = DriverManager.getConnection(url);
            } catch (Exception e) { e.printStackTrace(); }
            return conn;
        }

        private static void disconnect(Connection c) {
            try{ if(c != null){ c.close(); }
            } catch(SQLException e){ e.printStackTrace(); }
        }

        private static void print(Connection c){
            Statement stmt = null;
            String sql = "SELECT * FROM results;";
            ResultSet rs = null;
            try {
                stmt = c.createStatement();
                rs = stmt.executeQuery(sql);
                while(rs.next()){
                    System.out.println(rs.getString("TITLE"));
                }
            } catch(Exception e){ e.printStackTrace(); }
        }


    }
Sam
  • 2,172
  • 3
  • 24
  • 43
  • 1
    Add a [`VACUUM`](https://sqlite.org/lang_vacuum.html). – Elliott Frisch Jan 08 '17 at 01:25
  • @ElliottFrisch Thanks, this sounds promising. Do you have any advice regarding how often I should execute the `VACUUM` statement in the loop? – Sam Jan 08 '17 at 01:30
  • 1
    I would suggest trying after 10k inserts and see if that improves things. – Elliott Frisch Jan 08 '17 at 01:31
  • Don't put all of the inserts into a single batch. As mentioned in an answer to your [original question](http://stackoverflow.com/a/41519079/6680611), the batches should be kept to a relatively small size. – cartant Jan 08 '17 at 01:53
  • Also, with the use of `executeBatch`, the toggling of `setAutoCommit` is not as important and, as there's no `commit`, it should be removed. – cartant Jan 08 '17 at 01:57
  • BTW, when I saw your previous question, it had already been [cross posted](http://meta.stackexchange.com/q/64068/338149), so I voted to close it. It was closed as a duplicate of an Android SQLite question - which I don't believe it is, as you are using JDBC. I also don't think it was a great candidate for Code Review, as if it takes over four days to perform a weekly process, the code isn't what I'd be calling working; it is far too slow. (As a reference, I'm able to import ~2,600,000 spatially indexed geometries in about three minutes - the SQLite file is ~850 MB.) – cartant Jan 08 '17 at 02:00
  • @cartant Wow, that is very fast. I had originally posted the previous question on SO, but a commenter suggested cross posting on Code Review, since it compiled. But I agree, given the fact it doesn't complete, I wouldn't call it working either. – Sam Jan 08 '17 at 02:08
  • Regarding putting all of the insert statements into a single batch, I was originally calling `addBatch()` on every iteration and then: `if(x%1000 == 0){ executeBatch(); }` but found that only every thousandth `INSERT` was actually added to the db, so I changed it to what you can see in this question. I realise I'm probably doing it wrong though. – Sam Jan 08 '17 at 02:11

1 Answers1

1

Try removing the setAutoCommit calls and performing executeBatch only when a reasonably large number of inserts have been batched. Also, don't print to the console upon each insert. For example:

public static void addCSVToDatabase(Connection c, String csvFile) {

    BufferedReader reader = null;
    int batch = 0;
    int total = 0;
    DBEntryBean b;
    String[] vals;
    ByteArrayOutputStream baos = null;
    ObjectOutputStream oos = null;
    PreparedStatement pstmt = null;
    String sql = "INSERT OR IGNORE INTO results("
        + "TITLE, "
        + "URL, "
        + "BEAN"
        + ") VALUES(?, ?, ?);";

    try {
        pstmt = c.prepareStatement(sql);
        reader = new BufferedReader(new InputStreamReader(new FileInputStream(csvFile), "UTF-8"));

        for(String line; (line = reader.readLine()) != null;) {

            vals = line.split("\\|");
            b = new DBEntryBean();
            b.setTitle(vals[0]);
            b.setURL(vals[1]);

            baos = new ByteArrayOutputStream();
            oos = new ObjectOutputStream(baos);
            oos.writeObject(b);

            pstmt.setString(Constants.DB_COL_TITLE, b.getTitle());
            pstmt.setString(Constants.DB_COL_URL, b.getURL());
            pstmt.setBytes(Constants.DB_COL_BEAN, baos.toByteArray());

            pstmt.addBatch();
            ++batch;
            ++total;

            if (batch == 10000) {
                pstmt.executeBatch(); 
                System.out.println("Total: " + total);
                batch = 0;
            }
        }

        if (batch > 0) {
            pstmt.executeBatch(); 
            System.out.println("Total: " + total);
        }

    } catch (Exception e) { e.printStackTrace();
    } finally {
        try{ 
            if(pstmt != null) { pstmt.close(); }
        } catch (SQLException e) { e.printStackTrace(); }
    }
}

If performance is still terrible, I would suggest changing one thing at time to see if you can isolate the problem. For example, remove the UNIQUE index on the URL column to see what performance is like if it always inserts. Or remove the inserting of the BLOB, etc.

cartant
  • 57,105
  • 17
  • 163
  • 197