13

A Google Collections Multiset is a set of elements each of which has a count (i.e. may be present multiple times).

I can't tell you how many times I want to do the following

  1. Make a histogram (exactly Multiset)
  2. Get the top N elements by count from the histogram

Examples: top 10 URLs (by # times mentioned), top 10 tags (by # times applied), ...

What is the canonical way to do #2 given a Google Collections Multiset?

Here is a blog post about it, but that code is not quite what I want. First, it returns everything, not just top N. Second, it copies (is it possible to avoid a copy?). Third, I usually want a deterministic sort, i.e. tiebreak if counts are equal. Other nits: it's not static, etc.

dfrankow
  • 20,191
  • 41
  • 152
  • 214

2 Answers2

4

I wrote methods with the basic functionality you're asking for, except that they perform copies and lack deterministic tie-breaking logic. They're currently internal to Google, but we may open-source them at some point. This Guava issue has the method signatures.

Their algorithm is similar to the blog post: sorting a list of entries. It would be faster, but more complicated, to use a better selection algorithm.

EDIT: since Guava 11, this is implemented

lbalazscs
  • 17,474
  • 7
  • 42
  • 50
Jared Levy
  • 1,986
  • 13
  • 12
3

To give another perspective for people to comment on, I'll post a slightly modified version of the blog post I referenced:

package com.blueshiftlab.twitterstream.summarytools;

import com.google.common.collect.ImmutableList;
import com.google.common.collect.Multiset;
import com.google.common.collect.Ordering;
import com.google.common.collect.Multiset.Entry;

public class Multisets {
    // Don't construct one
    private Multisets() {
    }

    public static <T> ImmutableList<Entry<T>> sortedByCount(Multiset<T> multiset) {
        Ordering<Multiset.Entry<T>> countComp = new Ordering<Multiset.Entry<T>>() {
            public int compare(Multiset.Entry<T> e1, Multiset.Entry<T> e2) {
                return e2.getCount() - e1.getCount();
            }
        };
        return countComp.immutableSortedCopy(multiset.entrySet());
    }

    public static <T> ImmutableList<Entry<T>> topByCount(Multiset<T> multiset,
            int max) {
        ImmutableList<Entry<T>> sortedByCount = sortedByCount(multiset);
        if (sortedByCount.size() > max) {
            sortedByCount = sortedByCount.subList(0, max);
        }

        return sortedByCount;
    }
}
dfrankow
  • 20,191
  • 41
  • 152
  • 214
  • If I understand correctly, this solution will copy and sort the entire collection every time you want to retrieve the top N elements. I'm not sure what your requirements are, but the heap-sort-ish solution beats this in both time and space so I'm not sure what the benefit is. – danben Jun 12 '10 at 19:44
  • You are optimizing for speed, I am looking for fewest # of lines of code written by me. – dfrankow Jun 14 '10 at 13:59
  • I see - that was not clear from your post, especially since you asked about avoiding making a copy. – danben Jun 14 '10 at 14:30
  • careful, your comparator is sorting by count descending – nimcap Jul 02 '10 at 09:27
  • Good point. That is by design, but not called out explicitly. "top N" usually means descending sort. – dfrankow Jul 02 '10 at 13:40
  • @dfrankow: I know, but if I called `sortedByCount()` I would expect it to be the other way around :) – nimcap Aug 12 '10 at 08:44