5

I have the need of loading 100 million+ rows from a MySQL database in to memory. My java program fails with java.lang.OutOfMemoryError: Java heap space I have 8GB RAM in my machine and I have given -Xmx6144m in my JVM options.

This is my code

public List<Record> loadTrainingDataSet() {

    ArrayList<Record> records = new ArrayList<Record>();
    try {
        Statement s = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
        s.executeQuery("SELECT movie_id,customer_id,rating FROM ratings");
        ResultSet rs = s.getResultSet();
        int count = 0;
        while (rs.next()) {

Any idea how to overcome this problem?


UPDATE

I came across this post, as well as based on the comments below I updated my code. It seems I am able to load the data to memory with the same -Xmx6144m amount, but it takes a long time.

Here is my code.

...
import org.apache.mahout.math.SparseMatrix;
...

@Override
public SparseMatrix loadTrainingDataSet() {
    long t1 = System.currentTimeMillis();
    SparseMatrix ratings = new SparseMatrix(NUM_ROWS,NUM_COLS);
    int REC_START = 0;
    int REC_END = 0;

    try {
        for (int i = 1; i <= 101; i++) {
            long t11 = System.currentTimeMillis();
            REC_END = 1000000 * i;
            Statement s = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
                    java.sql.ResultSet.CONCUR_READ_ONLY);
            s.setFetchSize(Integer.MIN_VALUE);
            ResultSet rs = s.executeQuery("SELECT movie_id,customer_id,rating FROM ratings LIMIT " + REC_START + "," + REC_END);//100480507
            while (rs.next()) {
                int movieId = rs.getInt("movie_id");
                int customerId = rs.getInt("customer_id");
                byte rating = (byte) rs.getInt("rating");
                ratings.set(customerId,movieId,rating);
            }
            long t22 = System.currentTimeMillis();
            System.out.println("Round " + i + " completed " + (t22 - t11) / 1000 + " seconds");
            rs.close();
            s.close();
        }

    } catch (Exception e) {
        System.err.println("Cannot connect to database server " + e);
    } finally {
        if (conn != null) {
            try {
                conn.close();
                System.out.println("Database connection terminated");
            } catch (Exception e) { /* ignore close errors */ }
        }
    }
    long t2 = System.currentTimeMillis();
    System.out.println(" Took " + (t2 - t1) / 1000 + " seconds");
    return ratings;
}

To load first 100,000 rows it took 2 seconds. To load 29th 100,000 rows it took 46 seconds. I stopped the process in the middle since it was taking too much time. Are these acceptable amounts of time? Is there a way to improve the performance of this code? I am running this on 8GB RAM 64bit windows machine.

Community
  • 1
  • 1
ravindrab
  • 2,712
  • 5
  • 27
  • 37
  • 6
    Why would you possibly need them all at once? Implement some kind of paging on the database side, and get them in chunks as you need them. – Bridge Jan 26 '13 at 10:05
  • Not sure what your purpose is, but You could also make use of http://lucene.apache.org/solr/ in order to load them and you can write jasper etc on top of it and it will much much faster and efficient than direct interaction with DB. – Reno Jones Jan 26 '13 at 10:07
  • Consider processing/displaying data in small groups aka paging. – S.D. Jan 26 '13 at 10:11
  • @Bridge I am randomly accessing records in order to process the data. So I can't go for any paging scheme. – ravindrab Jan 26 '13 at 10:18
  • 2
    Please explain the need. Why would you go randomly? – Narendra Pathai Jan 26 '13 at 10:19
  • @Narendra Pathai After loading this data to the memory I am applying an clustering algorithm on the data. Randomness lies in the details of the algorithm. – ravindrab Jan 26 '13 at 10:25
  • 2
    An ArrayList is a bad choice when working with very large data sets. An array list initialized with its default constructor - infinitely worse. But even if you switch to a straight array you are going to have issues fitting all that data n main memory at once. – Perception Jan 26 '13 at 10:27
  • @AlanB here is similar question http://stackoverflow.com/questions/3682614/jdbc-how-to-read-all-rows-from-huge-table – Bulat Jan 26 '13 at 11:05
  • @Perception +1 Yes it is there is a significant difference. I saw just plain array of objects is 8 times faster than using ArrayList – ravindrab Jan 27 '13 at 08:22

4 Answers4

13

A hundred million records means that each record may take up at most 50 bytes in order to fit within 6 GB + some extra space for other allocations. In Java 50 bytes is nothing; a mere Object[] takes 32 bytes per element. You must find a way to immediately use the results in your while (rs.next()) loop and not retain them in full.

Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
  • The problem is I get the `java.lang.OutOfMemoryError` in the `s.executeQuery(` line it self. – ravindrab Jan 26 '13 at 10:16
  • @AlanB did you try to do this without s.getResultSet() ? – Bulat Jan 26 '13 at 10:38
  • 1
    @Bulat How is he getting his results without a ResultSet? – BackSlash Jan 26 '13 at 10:46
  • @Harlandraka Like this: `ResultSet rs = stmt.executeQuery("SELECT a, b FROM TABLE2")`; – Bulat Jan 26 '13 at 10:49
  • 1
    @Bulat yes, it is exactly the same as s.getResultSet() – BackSlash Jan 26 '13 at 10:56
  • +1 Suspect to attempt to load this much data once you need much more memory. Given you can buy 32 GB for less than $200, I would start with at least this much for any new system. – Peter Lawrey Jan 26 '13 at 11:01
  • 1
    @AlanB If you get OOME at `executeQuery`, this is a sign of either a poor JDBC driver or, worse, a poor database: instead of providing a cursor over the resultset, it eagerly tries to load the entire resultset into memory. What database are you using? – Marko Topolnik Jan 26 '13 at 11:47
  • 4
    The MySQL JDBC driver by default loads all rows into the ResultSet. You need to specify `setFetchSize(Integer.MIN_VALUE)` to actually get it to fetch row-by-row. – Mark Rotteveel Jan 26 '13 at 12:06
  • @MarkoTopolnik I am using MYSQL database. – ravindrab Jan 26 '13 at 23:52
  • @MarkRotteveel I used `st.setFetchSize(50);`. In fact I used the example as in [link] http://stackoverflow.com/questions/3682614/jdbc-how-to-read-all-rows-from-huge-table. But still the query didn't finish executing. I made sure I have this `conn.setAutoCommit(false); ` Is this specific to MYSQL ? – ravindrab Jan 27 '13 at 02:52
  • 2
    @AlanB MySQL ignores all values to `setFetchSize` except `Integer.MIN_VALUE`, see the [implementation notes](http://dev.mysql.com/doc/refman/5.5/en/connector-j-reference-implementation-notes.html), under `ResultSet` – Mark Rotteveel Jan 27 '13 at 08:31
3

The problem is I get the java.lang.OutOfMemoryError in the s.executeQuery( line it self

You can split your query in multiple ones:

    s.executeQuery("SELECT movie_id,customer_id,rating FROM ratings LIMIT 0,300"); //shows the first 300 results
    //process this first result
    s.executeQuery("SELECT movie_id,customer_id,rating FROM ratings LIMIT 300,600");//shows 300 results starting from the 300th one
    //process this second result
    //etc

You can do a while that stops when no more results are found

BackSlash
  • 21,927
  • 22
  • 96
  • 136
2

You can call stmt.setFetchSize(50); and conn.setAutoCommitMode(false); to avoid reading the entire ResultSet into memory.

Here's what the docs says:

Getting results based on a cursor

By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.

A small number of rows are cached on the client side of the connection and when exhausted the next block of rows is retrieved by repositioning the cursor.

Note:

  • Cursor based ResultSets cannot be used in all situations. There a number of restrictions which will make the driver silently fall back to fetching the whole ResultSet at once.
  • The connection to the server must be using the V3 protocol. This is the default for (and is only supported by) server versions 7.4 and later.-
  • The Connection must not be in autocommit mode. The backend closes cursors at the end of transactions, so in autocommit mode the backend will have closed the cursor before anything can be fetched from it.-
  • The Statement must be created with a ResultSet type of ResultSet.TYPE_FORWARD_ONLY. This is the default, so no code will need to be rewritten to take advantage of this, but it also means that you cannot scroll backwards or otherwise jump around in the ResultSet.-
  • The query given must be a single statement, not multiple statements strung together with semicolons.

Example : Setting fetch size to turn cursors on and off.

Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).

Class.forName("com.mysql.jdbc.Driver");
Connection conn = DriverManager.getConnection("jdbc:mysql://localhost/test?useCursorFetch=true&user=root");
// make sure autocommit is off 
conn.setAutoCommit(false); 
Statement st = conn.createStatement();

// Turn use of the cursor on. 
st.setFetchSize(50);
ResultSet rs = st.executeQuery("SELECT * FROM mytable");
while (rs.next()) {
   System.out.print("a row was returned.");
} 
rs.close();

// Turn the cursor off. 
st.setFetchSize(0);
rs = st.executeQuery("SELECT * FROM mytable");
while (rs.next()) {
   System.out.print("many rows were returned.");
} 
rs.close();

// Close the statement. 
st.close();
Vipin Jain
  • 3,686
  • 16
  • 35
0

You would have to redesign and load data in memory in chunks.

Example

1) Load first 1 million records from DB using appropriate SQL(sql to pick only 1 million) and process 2) Load another similar chunk.

setFetchSize alone won't solve this problem.