0

I need to make a program that counts frequency of each word in a text, additionally I need to be able to return a list of n most often words(if more words have same frequency they are sorted alphabetically). Also there is a list of words that are not counted (Stop words).

  1. What structure to use for the stop words
    • I think that HashSet would be most efficient
  2. What structure to use for the words and frequency mapping
    • HashMap would be more efficient for adding words, but would require sorting, TreeMap requires logn time for inserting words, but words can be sorted by frequency

What approach overall is more efficient?

P.S. @Moderators I know there is a similar question, but I have a different constrains that require a different structure.

Community
  • 1
  • 1
Bojan Trajkovski
  • 1,056
  • 1
  • 15
  • 31
  • in which way your question differs from the one you refer to? It looks completely similar, and the accepted answer for that question answers your question as well. – Roman Nov 26 '13 at 13:31
  • @Roman I don't need to "print each of the unique words in alphabetical order" , I have to get k most used words. So treeMap is not the obvious solution. – Bojan Trajkovski Nov 26 '13 at 22:03

3 Answers3

2

Let's assume there are k words in total and m distinct words and you want the n most frequent words.

TreeMap

Since there can never by more than m words in the map, each update / insert will cost O(log m), giving a total running time of O(k log m).

HashMap

Each update / insert will cost expected O(1), taking O(k) for all words.

Then, since there will be m words in the map, sorting will take O(m log m).

But we can do better than sorting - we can iterate through the HashMap and maintain a heap (PriorityQueue) of the n most frequent words (sorted primarily by frequency and secondarily alphabetically). After each insert into the heap, if the size is greater than n, we remove the least frequent word. This will take O(m log n).

So, the total running time would be expected O(k + m log n).

Comparison

Since n <= m and m <= k, we know that m log n <= k log m, and, assuming there are plenty of duplicates or n is somewhat smaller than m, k + m log n <= k log m, so the HashMap is generally the preferred option.

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
0

Stop words: HashSet or regexp.

I'd go with a hash map for the frequency counts:

  1. Your universe of words w is (more or less) finite
  2. Your input size n is (presumably) unbounded

So you'll be doing lots of insertions. Overall insertion cost for HashMap: O(n), for TreeMap: O(n log w).

Sorting cost for hash map: O(w log w), plus extraction overhead O(w). Even if it is zero for TreeMap, the O(w log w) will become very small for large n.

Note that in general there is no guaranteed way to figure this out without benchmarking both, though.

creichen
  • 1,728
  • 9
  • 16
  • The problem is that you cannot sort a HashMap in place. And if it's so huge that you care about performance, then copying it to array is quite a problem for you. – Roman Nov 26 '13 at 13:34
  • We're talking about word counts here. Those aren't going to be more than a couple tens of thousands (probably, unless it's an agglomerative language), all represented by references. So I don't see this as being too big for copy and sort-in-place. I agree, though, that my answer is most appropriate for large *n* and inappropriate if you're e.g. doing this for small *n* on an embedded device. That doesn't seem like the most likely platform, though. – creichen Nov 26 '13 at 13:36
0

You can perform a single pass over the final HashMap<String, Integer> of counts/words, and using a TreeMap<Integer, List<String>> and the TreeMap.firstKey() and TreeMap.size() methods calculate the Top N words. Exercise left to the reader. Alternately (or better), use a Red-Black tree like this one where you continually prune the lowest count node as you find counts that are larger (when the size of the tree is > N in your Top N).

brettw
  • 10,664
  • 2
  • 42
  • 59