2

Having a MySQL database with +8 million records that I need to process (that can't be done in the database itself), I encounter issues when trying to read them into my Java application.

I already tried some solutions of people with similar problems (eg., link) however, none have worked out for me. I tried to set the FetchSize and all, but no luck! My application is built making use of a BlockingQueue of which the Producer reads data continously from the database, stores it in the queue so the Consumer can process it. This way I limit the amount of records in main memory at the same time.

My code works for small amount of records (I tested for 1000 records) so I suggest the fase from database to my application needs to be fixed.

Edit1

connection = ConnectionFactory.getConnection(DATABASE);
preparedStatement = connection.prepareStatement(query, java.sql.ResultSet.CONCUR_READ_ONLY, java.sql.ResultSet.TYPE_FORWARD_ONLY);
preparedStatement.setFetchSize(1000); 
preparedStatement.executeQuery();
rs = preparedStatement.getResultSet();

Edit2

Eventually now I get some output other than seeing my memory go down. I get this error:

Exception in thread "Thread-0" java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.Buffer.<init>(Buffer.java:59)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:2089)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3554)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:491)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3245)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2413)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2836)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2828)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2777)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1651)
at razoralliance.dao.DataDAOImpl.getAllDataRS(DataDAOImpl.java:38)
at razoralliance.app.DataProducer.run(DataProducer.java:34)
at java.lang.Thread.run(Thread.java:722)

Edit3

I did some more research around the Producer-Consumer pattern and it turns out that, when the Consumer can not keep up with the Producer, the queue will automatically enlarge thus eventually run out of memory. So I switched to ArrayBlockingQueue which makes the size fixed. However, I still get memoryleaks. Eclipse Memory Analyzer says that ArrayBlockingQueue occupies 65,31% of my memory while it only has 1000 objects in memory with 4 fields all text.

Community
  • 1
  • 1
RazorAlliance192
  • 722
  • 9
  • 23
  • Can you tell more about what that processing does? – fge Mar 17 '14 at 11:06
  • @fge, It's for a project where I need to do text categorization. So in the database sits the raw representation of my data. With processing I have to remove useless information in order to improve the text categorization phase. So removal of stopewords, tokenization, ... – RazorAlliance192 Mar 17 '14 at 11:08
  • Did you call Statement.setMaxRows(0) so there's no limit to the ResultSet? – anonymous Mar 17 '14 at 11:08
  • You dont get any errors or exception, post them! – Keerthivasan Mar 17 '14 at 11:09
  • @anonymous, no I did not do such a thing, I'll add the part in my code where I contact the database – RazorAlliance192 Mar 17 '14 at 11:09
  • I'm not sure, just try this:(http://stackoverflow.com/a/2181176/3242978) – jmail Mar 17 '14 at 11:10
  • @Octopus, no I don't get any errors. I am using a utility called Memory Clean for Mac that visualizes the amount of memory in use. Once I start running my application, the available amount of memory goes down rapidly. That's the only signal i get. – RazorAlliance192 Mar 17 '14 at 11:12
  • You will have to call setMaxRows(0) before executing the statement to ensure there is no limit to the ResultSet. – anonymous Mar 17 '14 at 11:15
  • @RazorAlliance192 memory usage growing would suggest you're filling up your queue - most likely cause would be that the *consumer* is reading too slowly or isn't reading at all. You should either throttle the *producer* or add more *consumers* – Nick Holt Mar 17 '14 at 11:16
  • 1
    Fetch size has no impact on memory consumption, it only influences how many network roundtrips are performed. – Steven Pessall Mar 17 '14 at 11:16
  • @StevenPessall, from the website user Vlad gave, I read the following: _It's a mysql thing. MySQL does not support that traditional setFetchSize, so any value is ignored except Integer.MIN_VALUE, which causes proper streaming._ So Integer.MIN_VALUE can make streaming work? – RazorAlliance192 Mar 17 '14 at 11:20
  • I wasn't aware of this peculiarity of the MySQL driver, so you can disregard my comment. But if streaming does not solve your problem is pagination a possibility? I.e. splitting the processing into multiple transactions. – Steven Pessall Mar 17 '14 at 12:13

2 Answers2

4

You will need to stream your results. With the MySQL driver it appears you have to set CONCUR_READ_ONLY and TYPE_FORWARD_ONLY for your ResultSet. Also, set the fetch size accordingly: stmt.setFetchSize(Integer.MIN_VALUE);

By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate, and due to the design of the MySQL network protocol is easier to implement. If you are working with ResultSets that have a large number of rows or large values, and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.

To enable this functionality, create a Statement instance in the following manner:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY); stmt.setFetchSize(Integer.MIN_VALUE);

The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.

There are some caveats with this approach...

Community
  • 1
  • 1
Vlad
  • 10,602
  • 2
  • 36
  • 38
  • As you can see in the code snippet I added, I already use this, right? – RazorAlliance192 Mar 17 '14 at 11:18
  • @RazorAlliance192 that was not in the initial question and you're still missing `stmt.setFetchSize(Integer.MIN_VALUE);` – Vlad Mar 17 '14 at 11:22
  • My apologies good sir! I was thinking that Integer.MIN_VALUE was just a placeholder indicating some number that could be chosen by me (eg., only 1000 rows fetched per time) – RazorAlliance192 Mar 17 '14 at 11:27
  • What other limitations or caveats exist with this approach? Is streaming less efficient? Please edit your answer to explain more clearly benefits/pitfalls with this approach. – iordanis May 04 '14 at 17:38
0

Why do not you try this approach for this solution

Problem exporting a lot of data from database to .csv with java

Instead of fetching the entire result set it can be fetched one by one and then it can be used for processing. The link what i referring to you used to get record one by one and write into file but you can use this result for processing.This is one approach you can use.

another approach is you can multi threading concept which will fetch record as per your demand and will process separately.

Community
  • 1
  • 1
keepmoving
  • 1,813
  • 8
  • 34
  • 74