2

I've been reading up on complexity of HashSet vs TreeSet, and everywhere I find topics explaining: "HashSet is faster because it's O(1) instead of O(log n)." Of course, I know this.

However, that only holds if you're working with very large sets. I, on the other hand, need to work with millions of "small" sets, each containing at most 200 objects, and most far less (below 20). The operations on them are very diverse (creating, adding, removing, membership testing, cloning, ...) and hence I'm puzzled on how to best measure/simulate the difference.

Which of both classes will have the least overhead for such small set sizes? Both in terms of speed and of memory overhead. And what about LinkedHashSet?

user1111929
  • 6,050
  • 9
  • 43
  • 73
  • see this http://stackoverflow.com/questions/1463284/hashset-vs-treeset – Lital Kolog Jul 23 '15 at 05:27
  • I have read that topic completely before posting this, but I found no answer to my question there. – user1111929 Jul 23 '15 at 05:27
  • I think the more relevant question is, do you need the data in the set or be sorted, or to have a predictable iteration order? If not, just prefer using `HashSet`. It may not be significantly faster for your data size, but it also won't be any slower than the other options. – aroth Jul 23 '15 at 05:28
  • 1
    I think like @aroth, for small collections performance doesn't really play a factor. It is a matter of your use case and your convenience. What will be easier for you to maintain. – Lital Kolog Jul 23 '15 at 05:33
  • It definitely matters in terms of, for example, memory consumption. One million times 100 bytes more overhead is 100MB more memory consumption. – user1111929 Jul 23 '15 at 05:38
  • 1
    I'd expect memory consumption to follow the same general pattern. `HashSet` will use the least because it has the smallest number of ancillary datastructures (`LinkedHashSet` will internally have all the same things as `HashSet` and then also a doubly-linked list for tracking insertion order; `TreeSet` will also have a tree). Thus I'd think the general advice of 'prefer `HashSet` unless you need one of the _specific features_ provided by a different `Set` implementation' to be valid for both performance and memory usage. Though it's easy to benchmark memory usage if you're unsure. – aroth Jul 23 '15 at 05:43

2 Answers2

1

and hence I'm puzzled on how to best measure/simulate the difference.

Use a profiler. If the Set operations don't dominate the results (CPU time, memory footprint, allocation rates) then your choice will not make a difference in practice due to amdahl's law..

The biggest advantage of the TreeSet is ordering.

And neither implementation is particularly memory-efficient, there are better Sets out there, depending on which performance measure you care most about. They are wrappers around the corresponding Map implementations and the Maps themselves are not particularly efficient either.

They're more designed for flexibility, providing the vast set of APIs they do than optimizing any particular performance aspect.

the8472
  • 40,999
  • 5
  • 70
  • 122
1

There is no clear cut answer to this question because it all depends. So this will just be a scatter shot of comments.

  1. Hashset is much faster then a tree set. Even for small sets.
  2. If computing your hash and equals is very expensive consider wrapping your items in a class that only uses the specific identifying information for your use case. If items are immutable and reused often between hashsets consider caching the hash.
  3. Use a profiler to determine which solution works best on real data.
M.P. Korstanje
  • 10,426
  • 3
  • 36
  • 58