0

I have an application which accesses about 2 million tweets from a MySQL database. Specifically one of the fields holds a tweet of text (with maximum length of 140 characters). I am splitting every tweet into an ngram of words ngrams where 1 <= n <= 3. For example, consider the sentence:

I am a boring sentence.

The corresponding nGrams are:

I
I am
I am a
am
am a
am a boring
a
a boring
a boring sentence
boring
boring sentence
sentence

With about 2 million tweets, I am generating a lot of data. Regardless, I am surprised to get a heap error from Java:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2145)
    at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1922)
    at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3423)
    at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:483)
    at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3118)
    at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2288)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2709)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2728)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2678)
    at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1612)
    at twittertest.NGramFrequencyCounter.FreqCount(NGramFrequencyCounter.java:49)
    at twittertest.Global.main(Global.java:40)

Here is the problem code statement (line 49) as given by the above output from Netbeans:

results = stmt.executeQuery("select * from tweets");

So, if I am running out of memory it must be that it is trying to return all the results at once and then storing them in memory. What is the best way to solve this problem? Specifically I have the following questions:

  1. How can I process pieces of results rather than the whole set?
  2. How would I increase the heap size? (If this is possible)

Feel free to include any suggestions, and let me know if you need more information.

EDIT Instead of select * from tweets I partitioned the table into equally sized subsets of about 10% of the total size. Then I tried running the program. It looked like it was working fine but it eventually gave me the same heap error. This is strange to me because I have ran the same program in the past, successfully with 610,000 tweets. Now I have about 2,000,000 tweets or roughly 3 times as much more data. So if I split the data into thirds it should work, but I went further and split the subsets into size 10%.

Is some memory not being freed? Here is the rest of the code:

          results = stmt.executeQuery("select COUNT(*) from tweets");
          int num_tweets = 0;
          if(results.next())
          {
              num_tweets = results.getInt(1);
          }
          int num_intervals = 10;                  //split into equally sized subets
          int interval_size = num_tweets/num_intervals;

          for(int i = 0; i < num_intervals-1; i++)        //process 10% of the data at a time
          {
            results = stmt.executeQuery( String.format("select * from tweets limit %s, %s", i*interval_size, (i+1)*interval_size));
            while(results.next())  //for each row in the tweets database
            {
                tweetID = results.getLong("tweet_id");
                curTweet = results.getString("tweet");
                int colPos = curTweet.indexOf(":");
                curTweet = curTweet.substring(colPos + 1);                           //trim off the RT and retweeted 
                if(curTweet != null)
                {
                    curTweet = removeStopWords(curTweet);
                }

                if(curTweet == null)
                {
                    continue;
                }
                reader = new StringReader(curTweet);
                tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
                //tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
                //Set stopSet = StopFilter.makeStopSet(Version.LUCENE_36, stopWords, true);
                //tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopSet);
                tokenizer = new ShingleFilter(tokenizer, 2, 3);

                charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);

                while(tokenizer.incrementToken())                  //insert each nGram from each tweet into the DB
                {
                    insertNGram.setInt(1, nGramID++);
                    insertNGram.setString(2, charTermAttribute.toString().toString());
                    insertNGram.setLong(3, tweetID);
                    insertNGram.executeUpdate();
                }
            }
          }
trincot
  • 317,000
  • 35
  • 244
  • 286
CodeKingPlusPlus
  • 15,383
  • 51
  • 135
  • 216

3 Answers3

1

You can always increase the heap size available to your JVM using the -Xmx argument. You should read up on all the knobs available to you (e.g. perm gen size). Google for other options or read this SO answer.

You probably can't do this kind of problem with a 32-bit machine. You'll want 64 bits and lots of RAM.

Another option would be to treat it as a map-reduce problem. Solve it on a cluster using Hadoop and Mahout.

Community
  • 1
  • 1
duffymo
  • 305,152
  • 44
  • 369
  • 561
  • I have one machine... It seems silly to me that all I am doing is iterating over a result set and then extracting those ngrams. If I read one tweet into memory at each iteration, extracted all the ngrams into the database. I would only need enough memory to process one tweet at a time. Do you know of any way to do this in Java? – CodeKingPlusPlus Jul 16 '13 at 01:22
  • The problem is perm gen. All those Strings! Doesn't matter how you bring that tweets into memory; you'll keep making Strings. Profile with Visual VM to see what to do about it. Get more info. – duffymo Jul 16 '13 at 01:43
  • I have done this before without a problem with 610,000 tweets... So if I processed 25% of the data at a time I should be fine. However, I tried processing 10% of the data at a time and I still got a heap error... – CodeKingPlusPlus Jul 16 '13 at 13:34
  • Nope, you should use Visual VM to profile. Stop guessing and assuming what's fine. Get some data; be a scientist. – duffymo Jul 16 '13 at 14:48
1

Don't get all rows from table. Try to select partial data based on your requirement by setting limits to query. You are using MySQL database your query would be select * from tweets limit 0,10. Here 0 is starting row id and 10 represents 10 rows from start.

Kanagaraj M
  • 956
  • 8
  • 18
0

Have you considered streaming the result set? Halfway down the page is a section on result set, and it addresses your problem (I think?) Write the n grams to a file, then process the next row? Or, am I misunderstanding your problem? http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html

jmpyle771
  • 635
  • 5
  • 15