4

A problem I often encounter in Java (usually while writing computational linguistics code) is the need to count the number of occurrences of some items in a dataset, then sort the items by their counts. The simplest concrete example is word counting: I need to count the number of occurrences of each word in a text file, then sort the words by their counts to find the most frequently used words.

Unfortunately, Java doesn't seem to have a good data structure for this task. I need to use the words as indices of a collection while I'm counting, so that I can efficiently look up the right counter to increment every time I read a word, but the values I want to sort on are the counts, not the words.

Map<String, Integer> provides the interface I need for looking up the count associated with a word, but Maps can only be sorted by their keys (i.e. TreeMap). PriorityQueue is a nice heap implementation that will sort on whatever comparator you give it, but it provides no way to access the elements by some kind of index and no way to update and re-heapify an element (other than by removing and adding it). Its single type parameter also means I need to stick the words and their counts together into one object in order to use it.

My current "solution" is to store the counts in a Map while counting them, then copy them all into a PriorityQueue to sort them:

Map<String, Integer> wordCounts = countStuff();
PriorityQueue<NamedCount> sortedCounts = new PriorityQueue<>(wordCounts.size(),
                                             Collections.reverseOrder());
for(Entry<String, Integer> count : wordCounts.entrySet()) {
    sortedCounts.add(new NamedCount(count.getKey(), count.getValue()));
}

(Note that NamedCount is just a simple pair<string, int> that implements Comparable to compare the integers). But this is inefficient, especially since the data set can be very large, and keeping two copies of the count set in memory is wasteful.

Is there any way I can get random access to the objects inside the PriorityQueue, so that I can just store one copy of the counts in the PriorityQueue and re-heapify as I update them? Would it make sense to use a Map<String, NamedCount> that keeps "pointers" to the objects in the PriorityQueue<NamedCount>?

Edward
  • 5,942
  • 4
  • 38
  • 55
  • 1
    If you use Java 8, there are facilities in `Stream`s to do that – fge Jan 01 '15 at 20:20
  • 1
    Why don't you use NamedCount right away and map `Map`? This way you can getValues() as a Collection and sort that. – laune Jan 01 '15 at 20:20
  • @laune Simple as it is, that sounds like a good solution using only the built-in Java 7 library. I'm inclined to agree with Mark Peters, though, that `Multiset` is a conceptually cleaner design. – Edward Jan 02 '15 at 17:16

2 Answers2

3

First, for the base data structure, typically Guava's Multiset<String> is preferable to Map<String, Integer> in the same way that Set<String> is preferable to Map<String, Boolean>. It's a cleaner API and encapsulates the incrementing.

Now, if this were me, I would implement a custom Multiset which adds some additional logic to index the counts, and return them. Something like this:

class IndexedMultiset<T extends Comparable<T>> extends ForwardingMultiset<T> {

    private final Multiset<T> delegate = HashMultiset.create();
    private final TreeMultimap<Integer, T> countIndex = TreeMultimap.create();

    @Override
    protected Multiset<T> delegate() {
        return delegate;
    }


    @Override
    public int add(T element, int occurrences) {
        int prev = super.add(element, occurrences);
        countIndex.remove(prev, element);
        countIndex.put(count(element), element);
        return prev;
    }

    @Override
    public boolean add(T element) {
        return super.standardAdd(element);
    }

    //similar for remove, setCount, etc


}

Then I'd add whatever query capabilities you need based on counts. For example, retrieving an iterable of word/count pairs in descending order could look something like this:

public Iterable<CountEntry<T>> descendingCounts() {
    return countIndex.keySet().descendingSet().stream()
            .flatMap((count) -> countIndex.get(count).stream())
            .map((element) -> new CountEntry<>(element, count(element)))
            .collect(Collectors.toList());
}

public static class CountEntry<T> {
    private final T element;
    private final int count;

    public CountEntry(T element, int count) {
        this.element = element;
        this.count = count;
    }

    public T element() {
        return element;
    }

    public int count() {
        return count;
    }

    @Override
    public String toString() {
        return element + ": " + count;
    }
}

And it would all be used like this:

public static void main(String... args) {
    IndexedMultiset<String> wordCounts = new IndexedMultiset<>();

    wordCounts.add("foo");
    wordCounts.add("bar");
    wordCounts.add("baz");
    wordCounts.add("baz");

    System.out.println(wordCounts.descendingCounts()); //[baz: 2, bar: 1, foo: 1]


    wordCounts.add("foo");
    wordCounts.add("foo");
    wordCounts.add("foo");

    System.out.println(wordCounts.descendingCounts()); //[foo: 4, baz: 2, bar: 1]
}
Mark Peters
  • 80,126
  • 17
  • 159
  • 190
  • This looks pretty good, but it looks like you're sorting counts using Java 8 streams (right? I've never seen `->` as an operator in Java). Is there a Java 7 way to do it, since Ubuntu still doesn't have JDK 8? – Edward Jan 02 '15 at 17:12
  • @Edward I use JDK 8 on Ubuntu without a hitch. Just download. – laune Jan 02 '15 at 17:22
  • @Edward: All I'm doing is looping through the keys descending and mapping each word at each count to an entry. You could just use a procedural approach. Create a new arraylist, then loop through `countIndex.keySet().descendingSet()` and create and add each element accordingly. – Mark Peters Jan 02 '15 at 22:42
1

If you can use third-party libraries like Guava, Multiset is designed pretty specifically as a solution to this problem:

Multiset<String> multiset = HashMultiset.create();
for (String word : words) {
  multiset.add(word);
}
System.out.println(Multisets.copyHighestCountFirst(multiset));
Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • This code won't compile, line 'multiset.add(words)' needs to be rewritten into multiset.add(word) or better to add all elements when initializing multiset: HashMultiset.create(words); – extraleon Jan 01 '15 at 21:31
  • According to the documentation for `Multisets`, it looks like `copyHighestCountFirst` makes an entire copy of the Multiset in order to sort it by count - which is just as wasteful as copying my Map into a PriorityQueue. – Edward Jan 02 '15 at 17:09
  • A copy of a fairly cheap data structure like a Multiset is going to be cheaper than e.g. a TreeMultimap, even if there is only one. – Louis Wasserman Jan 02 '15 at 23:13