9

I've encountered a strange problem in production today. Though I love Guava, I ran into a use case where Guava's Sets.intersection() was performing pretty badly. I've written a sample code:

Set<Long> cache = new HashSet<>();
for (long i = 0; i < 1000000; i++) {
    cache.add(i);
}
Set<Long> keys = new HashSet<>();
for (long i = 0; i < 100; i++) {
    keys.add(i);
}
long start = System.currentTimeMillis();
Set<Long> foundKeys = new HashSet<>();
for (Long key : keys) {
    if (cache.contains(key)) {
        foundKeys.add(key);
    }
}
System.out.println("Java search: " + (System.currentTimeMillis() - start));
start = System.currentTimeMillis();
SetView<Long> intersection = Sets.intersection(keys, cache);
System.out.println("Guava search: " + (System.currentTimeMillis() - start));

I've tried to create a similar production scenario where I've a keys cache, and I'm looking for all the keys present in the cache. Strangely, Guava search is taking a lot longer than Java search. After running this I got:

Java search: 0
Guava search: 36

Can anyone tell why this isn't suited for my use case or is there a bug in Guava?

Edd
  • 3,724
  • 3
  • 26
  • 33
Heisenberg
  • 5,514
  • 2
  • 32
  • 43
  • 1
    please take a look at http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java – Sezin Karli May 21 '15 at 12:31
  • Yes, the Guava implementation happens to be asymmetric: if the first set is a lot bigger than the second one, it is a lot slower. Try switching the two sets. – biziclop May 21 '15 at 12:38
  • Yes, my first set is already much smaller. – Heisenberg May 21 '15 at 12:39
  • @Mr.White In that case see the first comment :) – biziclop May 21 '15 at 12:40
  • Yes, that helped. I wonder why I'm getting it in production on regular basis then. May be there is something else going on. I'll check further and post. – Heisenberg May 21 '15 at 12:43
  • 3
    @Mr.White What could happen in production that we can't see here is that `size()` is being called often on the set. Since the intersection set returned is just a live view of the two underlying sets, `size()` is recalculated every time it is called. For similar reasons the two original sets aren't garbage collected while the view is live, which can lead to significant slow-down if your system is under heavy load. But these are just guesses really, you should attach a profiler to your running instance of the full application to gather some real-life data. – biziclop May 21 '15 at 12:51
  • That was pretty brilliant. Thanks for your help. size() was causing all the trouble. Even though the size was pretty small. – Heisenberg May 21 '15 at 12:59
  • @biziclop Would you mind explaining size performance in a bit detail as an answer. I wanted to accept this as an answer. – Heisenberg May 21 '15 at 13:03

1 Answers1

9

It turns out the problem was multiple calls to SetView.size(). As SetView is a (live) view of the intersection of the two sets, the intersection size needs to be recalculated every time.

public static <E> SetView<E> intersection( final Set<E> set1, final Set<?> set2) {
//...
  return new SetView<E>() {
    @Override public Iterator<E> iterator() {
      return Iterators.filter(set1.iterator(), inSet2);
    }
    @Override public int size() {
      return Iterators.size(iterator());
    }
    //...
  };
}

As can be seen here, what recalculation means in this case is iterating across the entire view, which can be quite time-consuming.

The way to get around this is therefore either to make sure size() is only called once and the value is stored (if you know the underlying sets won't change), or if that isn't possible, create a copy of the intersection via ImmutableSet.copyOf() (for example).

biziclop
  • 48,926
  • 12
  • 77
  • 104