I want to scan through a huge corpus of text and count word frequencies (n-gram frequencies actually for those who are familiar with NLP/IR). I use a Java HashMap for this. So what happens is I process the text line by line. For each line, I extract the words, and for each word, I update the corresponding frequency in the hashmap.
The problem is that this process gets slower and slower. For example, it starts by processing around 100k lines / second - and the performance starts falling right away. After about 28 million lines , the performance has fallen to 16k lines / second - and of course keeps falling.
First thing that came to mind was that it was caused of too many entries in the hashmap, which caused every put and every get to be slower every time. So what I tried was to only keep the most (say 100k) frequent entries in the hashmap at anytime. This was done by using a second map that mapped frequencies to words (as in here: Automatically sorted by values map in Java )
This performed a lot faster in general. (although it started at 56 k lines / sec, by the time it reached 28 mil lines, the performance had only dropped to 36.5k lines / sec). However, this also kept falling, at a much slower rate - but the fact remains, that it kept falling.
Have you got any possible explanation of why does this happen when the hashmap's size remains the same? Do you think this has anything to do with the garbage collector? Meaning, that the fact that I keep putting, and removing object to/from hashmaps fragments up the memory or something? Or could it be a hashing function problem? Since I'm using strings, the hashing function is Java's default hashing function for strings.
Here is the part of my code that performs the aforementioned task:
NOTE: I am a Java newbie so any elaboration in your answers is more than welcome