7

In addition to this quite old post, I need something that will use primitives and give a speedup for an application that contains lots of HashSets of Integers:

Set<Integer> set = new HashSet<Integer>();

So people mention libraries like Guava, Javalution, Trove, but there is no perfect comparison of those in terms of benchmarks and performance results, or at least good answer coming from good experience. From what I see many recommend Trove's TIntHashSet, but others say it is not that good; some say Guava is supercool and manageable, but I do not need beauty and maintainability, only time execution, so Python's style Guava goes home :) Javalution? I've visited the website, seems too old for me and thus wacky.

The library should provide the best achievable time, memory does not matter.

Looking at "Thinking in Java", there is an idea of creating custom HashMap with int[] as keys. So I would like to see something similar with a HashSet or simply download and use an amazing library.

EDIT (in response to the comments below) So in my project I start from about 50 HashSet<Integer> collections, then I call a function about 1000 times that inside creates up to 10 HashSet<Integer> collections. If I change initial parameters, the numbers may grow up exponentially. I only use add(), contains() and clear() methods on those collections, that is why they were chosen.

Now I'm going to find a library that implements HashSet or something similar, but will do that faster due to autoboxing Integer overhead and maybe something else which I do not know. In fact, I'm using ints as my data comes in and store them in those HashSets.

Community
  • 1
  • 1
Sophie Sperner
  • 4,428
  • 8
  • 35
  • 55
  • If memory truly doesn't matter at all, I'd suggest a BitSet (with an offset, if you need to handle negative numbers). – Ted Hopp Aug 06 '12 at 21:59
  • You haven't given enough information. What is your use case that you have to squeeze out the last 1% improvement? Are the integers evenly distributed in `[Integer.MIN_VALUE, Integer.MAX_VALUE]`? Performance will depend on distribution, the hash function, and memory allocated. What do you mean by "lots of HashSets"... Do the "lots of HashSets" participate in their own higher-level data structure? Please flesh out the question. – Jim Garrison Aug 06 '12 at 22:03
  • 1
    We don't know enough about your project to make this call for you. Most projects can use `Set` without any issues. I would suggest you try out the libraries you mentioned and measure your performance. – Jeffrey Aug 06 '12 at 22:04
  • 1
    I'd suggest NOT using a bit set if you're mapping a huge range. As @JimGarrison stated, [`Integer.MIN_VALUE`, `Integer.MAX_VALUE`] would occupy (according to [WolframAlpha](http://www.wolframalpha.com/input/?i=2%5E32+bits+in+mib)) 512 MiB using a bit set. – obataku Aug 06 '12 at 22:06
  • 1
    If you're looking into picking a particular underlying implementation of the hash table for the integers, know that, since the elements are small and can fit into cache lines, you could look into _open-addressing_ as opposed to separate chaining. This is probably [premature optimization](http://c2.com/cgi/wiki?PrematureOptimization) at this stage, however. – obataku Aug 06 '12 at 22:13
  • 2
    The entire question seems like premature optimization. – Jim Garrison Aug 06 '12 at 22:15
  • to Jim: integers are from the range [0, Integer.MAX_VALUE], have nothing to say about the distribution since I take the data as it is without probability analysis. The integers are themselves keys in the HashSets, so I do not need the hash function... if I understand correctly. I tried to explain in the EDIT about lots of HashSets. – Sophie Sperner Aug 06 '12 at 22:17
  • 1
    Can't you just write your own HashSet? It's not hard. – Ryan Amos Aug 06 '12 at 22:20
  • Well, I guess it is not, but if it has been already written, then why not using the working code? This is what Reusing code is about, is not it? :) – Sophie Sperner Aug 06 '12 at 22:21
  • Jim --> Do the "lots of HashSets" participate in their own higher-level data structure? Please flesh out the question: No, they are plain `HashSet` collections, only once I used an array of `HashSet`s. – Sophie Sperner Aug 06 '12 at 22:23
  • You're going to spend more time looking for a premade HashSet than writing your own. In fact, I think in the 27 minutes since you made this post, you could easily already have one. Edit: I can write one for you if you don't know how. – Ryan Amos Aug 06 '12 at 22:24
  • 1
    Another question: which operations are invoked more frequently - add() or contains()? Also, what is the lifecycle of a set - is it populated and then ONLY tested for contains() (no new adds)? – jkff Aug 06 '12 at 22:25
  • Btw, if you have an exponential component in the algorithm, you'll probably benefit A LOT more from algorithmic optimizations rather than from a fast HashSet. – jkff Aug 06 '12 at 22:33
  • contains() happens more often, because elements are added based on some program-specific conditions; quite often contains() is checked in every loop iteration; the lifecycle is dynamic, integers are added on the fly and contains() executes each time on a different set of integers. – Sophie Sperner Aug 06 '12 at 22:34
  • Yes, algorithmic optimizations have been done, now the time for data structure improvements. – Sophie Sperner Aug 06 '12 at 22:35
  • Yes, Ryan, I do not yet know how to write efficient custom HashSet. – Sophie Sperner Aug 06 '12 at 22:42
  • 1
    It should be reasonably easy to try each implementation in your code and benchmark them against each other. If you do that, you could post the results here as an answer, and accept it. It would be very useful for people who come after you and ask the same question. – Tom Anderson Aug 06 '12 at 23:25
  • @SophieSperner - well, then static datastructures (perfect hashing, sorted array etc) are not an option... Then just try a few of: Trove, fastutil, colt etc, and see which one is faster. Don't forget to tune the HashSet constructor parameters. – jkff Aug 06 '12 at 23:53
  • Three additional questions: (1) How many entries are in each hash set (order of magnitude), (2) How many times do you expect to call contains() on a hash set (again, order of magnitude), and (3) What is the relative frequency of contains() vs add() – Jim Garrison Aug 07 '12 at 00:07
  • 1
    I solved the problem by using HPPC library. Quite nice one. With a good support I would say. – Sophie Sperner Aug 07 '12 at 17:12

3 Answers3

4

Trove is an excellent choice.

The reason why it is much faster than generic collections is memory use.

A java.util.HashSet<Integer> uses a java.util.HashMap<Integer, Integer> internally. In a HashMap, each object is contained in an Entry<Integer, Integer>. These objects take estimated 24 bytes for the Entry + 16 bytes for the actual integer + 4 bytes in the actual hash table. This yields 44 bytes, as opposed to 4 bytes in Trove, an up to 11x memory overhead (note that unoccupied entires in the main table will yield a smaller difference in practise).

See also these experiments:

http://www.takipiblog.com/2014/01/23/java-scala-guava-and-trove-collections-how-much-can-they-hold/

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
2

Take a look at the High Performance Primitive Collections for Java (HPPC). It is an alternative to trove, mature and carefully designed for efficiency. See the JavaDoc for the IntOpenHashSet.

cruftex
  • 5,545
  • 2
  • 20
  • 36
0

Have you tried working with the initial capacity and load factor parameters while creating your HashSet?

HashSet doc

Initial capacity, as you might think, refers to how big will the empty hashset be when created, and loadfactor is a threshhold that determines when to grow the hash table. Normally you would like to keep the ratio between used buckets and total buckets, below two thirds, which is regarded as the best ratio to achieve good stable performance in a hash table.

Dynamic rezing of a hash table

So basically, try to set an initial capacity that will fit your needs (to avoid re-creating and reassigning the values of a hash table when it grows), as well as fiddling with the load factor until you find a sweet spot.

It might be that for your particular data distribution and setting/getting values, a lower loadfactor could help (hardly a higher one will, but your milage may vary).

Acapulco
  • 3,373
  • 8
  • 38
  • 51
  • Yes, I did try, but unfortunately when I know the size of a `HashSet` and tell it `new HashSet(n / 2)` it gives worse performance than without that capacity... Very strange, I guessed that it creates big arrays and spends a lot of time when doing lookup (during contains possibly), well, at least Bruce Eckel in Thinking in Java said something like that. Thank you for the answer anyway, I kind of accept it :) – Sophie Sperner Aug 07 '12 at 17:14
  • Why would you do n / 2? If you know you are going to have exactly n elements, then use n. In an optimal case, each bucket will have exactly one element, but in case there happens to be a collision (unlikely if your data set is not very large) perfomance will not suffer much as long as you don't recalculate and reassign the hashtable (i.e. insert more than n elements, in your case, n/2). As per the hashtable doc (http://docs.oracle.com/javase/6/docs/api/java/util/Hashtable.html) – Acapulco Aug 07 '12 at 22:12
  • " "The initial capacity controls a tradeoff between wasted space and the need for rehash operations, which are time-consuming. No rehash operations will ever occur if the initial capacity is greater than the maximum number of entries the Hashtable will contain divided by its load factor. However, setting the initial capacity too high can waste space." So, if you are only holding ints, then the wasted space problem shouldn't be an issue. Try setting it to at least n, or even better, try n+1: – Acapulco Aug 07 '12 at 22:16
  • From the doc again: "If many entries are to be made into a Hashtable, creating it with a sufficiently large capacity may allow the entries to be inserted more efficiently than letting it perform automatic rehashing as needed to grow the table" – Acapulco Aug 07 '12 at 22:17
  • 1
    More than n, try actually doing capacity=x where x= 4/3 n (i.e. n * 1.33333), such that if the loadfactor is the default one (0.75) you would only need to rehash the table if you hit 3/4x, which will be n, so no rehashing will ocurr (and to account for rounding errors, try x+1) – Acapulco Aug 07 '12 at 22:22
  • Thank you for the effort given, I understand now! – Sophie Sperner Aug 08 '12 at 09:21