A problem I often encounter in Java (usually while writing computational linguistics code) is the need to count the number of occurrences of some items in a dataset, then sort the items by their counts. The simplest concrete example is word counting: I need to count the number of occurrences of each word in a text file, then sort the words by their counts to find the most frequently used words.
Unfortunately, Java doesn't seem to have a good data structure for this task. I need to use the words as indices of a collection while I'm counting, so that I can efficiently look up the right counter to increment every time I read a word, but the values I want to sort on are the counts, not the words.
Map<String, Integer>
provides the interface I need for looking up the count associated with a word, but Maps can only be sorted by their keys (i.e. TreeMap
). PriorityQueue
is a nice heap implementation that will sort on whatever comparator you give it, but it provides no way to access the elements by some kind of index and no way to update and re-heapify an element (other than by removing and adding it). Its single type parameter also means I need to stick the words and their counts together into one object in order to use it.
My current "solution" is to store the counts in a Map while counting them, then copy them all into a PriorityQueue
to sort them:
Map<String, Integer> wordCounts = countStuff();
PriorityQueue<NamedCount> sortedCounts = new PriorityQueue<>(wordCounts.size(),
Collections.reverseOrder());
for(Entry<String, Integer> count : wordCounts.entrySet()) {
sortedCounts.add(new NamedCount(count.getKey(), count.getValue()));
}
(Note that NamedCount
is just a simple pair<string, int>
that implements Comparable
to compare the integers). But this is inefficient, especially since the data set can be very large, and keeping two copies of the count set in memory is wasteful.
Is there any way I can get random access to the objects inside the PriorityQueue
, so that I can just store one copy of the counts in the PriorityQueue and re-heapify as I update them? Would it make sense to use a Map<String, NamedCount>
that keeps "pointers" to the objects in the PriorityQueue<NamedCount>
?