1

I'm using inside an iterative algorithm an HashSet that is dynamically enlarged at each algorithm iteration by adding new objects (via method add). Very frequently I check if a generated object has been already put inside the HashSet by using the contains method. Observe that the HashSet may include several thousand objects.

Here follows a citation from the doc about class HashSet: "This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets."

Apart from other considerations provided inside the doc (not reported for simplicity), I see that add and contains are executed in constant time. Please, can you suggest another data structure in Java that provides better performance for the "contains" operation with respect to my problem?

Classes from Apache Commons or Guava are also accepted.

mat_boy
  • 12,998
  • 22
  • 72
  • 116
  • Would a *sorted* hash set give better lookup performance, for the price of more expensive adds / deletes? – vikingsteve Sep 06 '13 at 10:49
  • 3
    Constant time is usually considered the best performance characteristic you can get. It means that it takes the same amount of time no matter if the Set contains one element or a million. – Thilo Sep 06 '13 at 10:49
  • Did you notice any specific problem with HashSet performance? – Thilo Sep 06 '13 at 10:52
  • HashMap has operation cost of O(1) for access. That is the best cost you can get. – Antoniossss Sep 06 '13 at 10:52
  • Ok! Thank you for the explanations. However, I see that the implementation of the hash code function is very important. I use the one provided suggested in NetBeans – mat_boy Sep 06 '13 at 10:54
  • This question is not about Big-O performance, but about the constants that are ignored in this notation. – SpaceTrucker Sep 06 '13 at 11:03
  • @Antoniossss I thought `HashSet` was backed by a `HashMap` [anyway](http://docjar.com/html/api/java/util/HashSet.java.html)? – Nicolas Rinaudo Sep 06 '13 at 14:51

3 Answers3

3

The performance of HashSet.contains() will be as good as you can get provided your objects have a properly implemented hashCode() method. That will ensure proper distribution among the buckets.

See Best implementation for hashCode method

Community
  • 1
  • 1
NickJ
  • 9,380
  • 9
  • 51
  • 74
  • So I can improve the performance implementing an optimized hashCode method. Am I right? – mat_boy Sep 06 '13 at 10:54
  • I think my implementation of hashCode already follows your guideline. I use the hashCode function suggested by NetBeans for the primitive types (the only types contained in my object). – mat_boy Sep 06 '13 at 10:58
  • I don't use NetBeans so cannot vouch for the suitability of its suggestions. Presumably your hashCode method uses all the non-static primitives? If your object contained other objects, you could call their own hashCode methods from yours. – NickJ Sep 08 '13 at 10:22
  • Most IDEs including eclipse will generate pretty good (fast) hashCode and equals. – Adam Gent Sep 20 '13 at 12:19
1

As other answers already stated "constant time" is the best runtime-behaviout you can get. If you will get it does depend on your hashcode-implementation, but since you use the NetBeans suggestion you shouldn't be too bad there.

As to how to keep the "constant time" as small as possible:

  • try to allocate your HashSet large enough from the very beginning to avoid costly rehash-operations
  • You can cache your calculated hashcode the first time hashCode() is called and return the cached value later on. There should be no need to add some triggering-mechanism to clear the cache on object-updates, since your relevant fields should be immutable - if they aren't you are bound to run into trouble using HashSet anyway.
piet.t
  • 11,718
  • 21
  • 43
  • 52
-2

You can let your object remember if it has been put in that hashset. Just have a boolean field to store if it was added to the hash set. Then you don't need to call contains on the HashSet but just read the field value of your object. This method will only work if the object is put in exactly one hashset that will check the boolean field.

It might be extended to a constant number of hashsets using java.util.BitSet in the object contained in the hashset where every hashset can be identified by a unique integer when the number of hashsets is known before the algorithm starts.

Since you are saying that you are calling contains frequently, it makes sense to replace newly generated objects with equal existing objects (object pooling), since the overhead of that will amortize by having contains being only a single field read.

As requested here is some sample code. The special set implementation is about 4 times faster than a normal hash set on my machine. However the question is how well this code reflects your use case.

public class FastSetContains {

    public static class SetContainedAwareObject {
        private final int state;
        private boolean contained;

        public SetContainedAwareObject(int state) {
            this.state = state;
        }

        public void markAsContained() {
            contained = true;
        }

        public boolean isContained() {
            return contained;
        }

        public void markAsRemoved() {
            contained = false;
        }

        @Override
        public int hashCode() {
            final int prime = 31;
            int result = 1;
            result = prime * result + state;
            return result;
        }

        @Override
        public boolean equals(Object obj) {
            if (this == obj)
                return true;
            if (obj == null)
                return false;
            if (getClass() != obj.getClass())
                return false;
            SetContainedAwareObject other = (SetContainedAwareObject) obj;
            if (state != other.state)
                return false;
            return true;
        }
    }

    public static class FastContainsSet extends
            HashSet<SetContainedAwareObject> {

        @Override
        public boolean contains(Object o) {
            SetContainedAwareObject obj = (SetContainedAwareObject) o;
            if (obj.isContained()) {
                return true;
            }
            return super.contains(o);
        }

        @Override
        public boolean add(SetContainedAwareObject e) {
            boolean add = super.add(e);
            e.markAsContained();
            return add;
        }

        @Override
        public boolean addAll(Collection<? extends SetContainedAwareObject> c) {
            boolean addAll = super.addAll(c);
            for (SetContainedAwareObject o : c) {
                o.markAsContained();
            }
            return addAll;
        }

        @Override
        public boolean remove(Object o) {
            boolean remove = super.remove(o);
            ((SetContainedAwareObject) o).markAsRemoved();
            return remove;
        }

        @Override
        public boolean removeAll(Collection<?> c) {
            boolean removeAll = super.removeAll(c);
            for (Object o : c) {
                ((SetContainedAwareObject) o).markAsRemoved();
            }
            return removeAll;
        }
    }

    private static final Random random = new Random(1234L);
    private static final int additionalObjectsPerIteration = 10;
    private static final int iterations = 100000;
    private static final int differentObjectCount = 100;
    private static final int containsCountPerIteration = 50;

    private static long nanosSpentForContains;

    public static void main(String[] args) {
        Map<SetContainedAwareObject, SetContainedAwareObject> objectPool = new HashMap<>();

        // switch comment use different Set implementaiton
        //Set<SetContainedAwareObject> set = new FastContainsSet();
        Set<SetContainedAwareObject> set = new HashSet<>();

        //warm up
        for (int i = 0; i < 100; i++) {
            addAdditionalObjects(objectPool, set);
            callSetContainsForSomeObjects(set);
        }
        objectPool.clear();
        set.clear();
        nanosSpentForContains = 0L;

        for (int i = 0; i < iterations; i++) {
            addAdditionalObjects(objectPool, set);
            callSetContainsForSomeObjects(set);
        }
        System.out.println("nanos spent for contains: " + nanosSpentForContains);
    }

    private static void callSetContainsForSomeObjects(
            Set<SetContainedAwareObject> set) {
        int containsCount = set.size() > containsCountPerIteration ? set.size()
                : containsCountPerIteration;
        int[] indexes = new int[containsCount];
        for (int i = 0; i < containsCount; i++) {
            indexes[i] = random.nextInt(set.size());
        }
        Object[] elements = set.toArray();

        long start = System.nanoTime();
        for (int index : indexes) {
            set.contains(elements[index]);
        }
        long end = System.nanoTime();
        nanosSpentForContains += (end - start);
    }

    private static void addAdditionalObjects(
            Map<SetContainedAwareObject, SetContainedAwareObject> objectPool,
            Set<SetContainedAwareObject> set) {
        for (int i = 0; i < additionalObjectsPerIteration; i++) {
            SetContainedAwareObject object = new SetContainedAwareObject(
                    random.nextInt(differentObjectCount));
            SetContainedAwareObject pooled = objectPool.get(object);
            if (pooled == null) {
                objectPool.put(object, object);
                pooled = object;
            }
            set.add(pooled);
        }
    }
}

Anothe Edit:

using the following as the Set.contains implementation makes it about 8 times faster than a normal hashset:

    @Override
    public boolean contains(Object o) {
        SetContainedAwareObject obj = (SetContainedAwareObject) o;
        return obj.isContained();
    }

EDIT: This technique has a bit with the class enhancement of OpenJPA in common. The enhancement of OpenJPA enables a class to track its persistent state which is used by the entity manager. The suggested method enables an object to track if itself is contained in a set which is used by the algorithm.

SpaceTrucker
  • 13,377
  • 6
  • 60
  • 99
  • I can't. The algorithm generate new objects which may be equal to one of the previously generated objects! – mat_boy Sep 06 '13 at 10:56
  • @mat_boy Then replace new objects with those already contained in the HashSet or use object pooling. – SpaceTrucker Sep 06 '13 at 11:00
  • Object pooling [is a deprecated technique](http://programmers.stackexchange.com/questions/115163/is-object-pooling-a-deprecated-technique). – mat_boy Sep 06 '13 at 11:37
  • @mat_boy please read the accepted anser: *deprecated as a **general** technique*, but your case is special – SpaceTrucker Sep 06 '13 at 11:39
  • 1
    That won't work with "equal" objects. Even if you have one instance already in the set (and it knows about that), when you have a second instance that is equal it will not have the flag set. Consider for example reading lines from a file and de-duplicating them. Every line will yield a new String instance. None of them will have your flag set, even if a line of the same content was already added to the Set. – Thilo Sep 06 '13 at 23:16
  • @Thilo That's why I said object pooling should be used. You would first create a new object with the flag not set and then would replace that object with an already existing one which might have the flag set. – SpaceTrucker Sep 07 '13 at 08:47
  • So how do you do object pooling without a Set or Map? The question seems to be about creating an object pool. You cannot answer that by "use an object pool". – Thilo Sep 07 '13 at 08:51
  • @SpaceTrucker Please, why don't you provide an abstract example in your answer to illustrate your idea. – mat_boy Sep 07 '13 at 10:10
  • Hey, you are calling the `contains` method in the for cycle to control if an object is contained, when you know that has been previously marked as contained. To see that this is not useful, in the second for create a new object and then check if it is contained. Obviously this will provide a call to the `super.contains()` method. Hence, I will have no speedup. However, I do agree that object pooling may help (e.g. by using _Apache Commons pool component_) – mat_boy Sep 07 '13 at 11:06
  • In my career as a Java developer I've come across a lot of code which is similar to the code in this answer. It always makes me cringe and want to delete it and start again. It's just so over-complicated for such a simple problem, as if the programmer has lost sight of the final objective and got stuck in an endless cycle of boilerplate code. The suggested answer makes it difficult to ensure that the set-membership property of the object is correct, and precludes the use of POJO's in the collection. – NickJ Sep 08 '13 at 10:18
  • 1
    http://en.wikipedia.org/wiki/Boilerplate_code Also, it's easy to break your code: SetContainedAwareObject a = new SetContainedAwareObject(); a.markAsContained(); //not in set, but marked anyway SetContainedAwareObject b = new SetContainedAwareObject(); FastContainsSet set = new FastContainsSet(); set.add(b); b.markAsRemoved(); //in set, but marked as not in set! – NickJ Sep 08 '13 at 11:09