0

I want to scan through a huge corpus of text and count word frequencies (n-gram frequencies actually for those who are familiar with NLP/IR). I use a Java HashMap for this. So what happens is I process the text line by line. For each line, I extract the words, and for each word, I update the corresponding frequency in the hashmap.

The problem is that this process gets slower and slower. For example, it starts by processing around 100k lines / second - and the performance starts falling right away. After about 28 million lines , the performance has fallen to 16k lines / second - and of course keeps falling.

First thing that came to mind was that it was caused of too many entries in the hashmap, which caused every put and every get to be slower every time. So what I tried was to only keep the most (say 100k) frequent entries in the hashmap at anytime. This was done by using a second map that mapped frequencies to words (as in here: Automatically sorted by values map in Java )

This performed a lot faster in general. (although it started at 56 k lines / sec, by the time it reached 28 mil lines, the performance had only dropped to 36.5k lines / sec). However, this also kept falling, at a much slower rate - but the fact remains, that it kept falling.

Have you got any possible explanation of why does this happen when the hashmap's size remains the same? Do you think this has anything to do with the garbage collector? Meaning, that the fact that I keep putting, and removing object to/from hashmaps fragments up the memory or something? Or could it be a hashing function problem? Since I'm using strings, the hashing function is Java's default hashing function for strings.

Here is the part of my code that performs the aforementioned task:

http://pastebin.com/P8S6Sj86

NOTE: I am a Java newbie so any elaboration in your answers is more than welcome

Community
  • 1
  • 1
Alexandros
  • 4,425
  • 4
  • 23
  • 21
  • From the provided link: `Map ngramHash = new HashMap();`. Have you tried to compile that code? – Roman Sep 19 '11 at 18:32
  • have you try to check the ram usage ? – Heisenbug Sep 19 '11 at 18:36
  • I don't think the problem is the HashMap. Your code needs to be composed into several methods, which you instrument, then locate the performance problem. There are several potential areas - string concatenation in the inner loop, unnecessary creation of Double objects and Float objects, as well as a difficult-to-follow logic flow. – Steve McLeod Sep 19 '11 at 18:42

2 Answers2

3

I recommend using Java VisualVM to do some profiling. This comes with Java - go to the command line and type jvisualvm to run it. This makes it easy to see if memory churn is your problem, or if particular types of objects are being created hundreds of thousands of times.

If you break up your code into several methods, you'll also be able to tell which methods take too long to run.

I did notice that you are unnecessarily creating lots of objects in inner loops. This certainly won't help performance, although it may not be the main culprit.

For example:

float avg = new Float(sumItems) / new Float (freqMap.size());

should just be

float avg = (float)sumItems / freqMap.size();

Another piece of your code which also could be troublesome is:

System.out.println(numItems + " items counted");

Depending on your operating system or IDE, writing 100,000s of lines to the console requires significant time. Instead, just write an update of progress for each 1000 items.

Steve McLeod
  • 51,737
  • 47
  • 128
  • 184
  • sorry it took so long, I had to test things out - i did try those things and found out what the real problem was (it had to do with the data I was parsing, it was sorted in some way that more complex items were towards the end while) - so I solved the problem with shuffling my input data - thanks for the tips – Alexandros Sep 22 '11 at 13:40
1

Suggestion:

Try implementing a custom hashCode method for the object you're storing in your hashmap. Here are some links:

Java HashMap performance optimization / alternative

http://www.ibm.com/developerworks/java/library/j-jtp05273/index.html

http://www.javamex.com/tutorials/collections/hash_function_guidelines.shtml

Bad idea to use String key in HashMap?

Community
  • 1
  • 1
paulsm4
  • 114,292
  • 17
  • 138
  • 190