157

There seems to be a lot of different implementations and ways to generate thread-safe Sets in Java. Some examples include

1) CopyOnWriteArraySet

2) Collections.synchronizedSet(Set set)

3) ConcurrentSkipListSet

4) Collections.newSetFromMap(new ConcurrentHashMap())

5) Other Sets generated in a way similar to (4)

These examples come from Concurrency Pattern: Concurrent Set implementations in Java 6

Could someone please simply explain the differences, advantages, and disadvantage of these examples and others? I'm having trouble understanding and keeping straight everything from the Java Std Docs.

zeugor
  • 822
  • 1
  • 8
  • 18
Ben
  • 2,430
  • 3
  • 22
  • 24

4 Answers4

235
  1. The CopyOnWriteArraySet is a quite simple implementation - it basically has a list of elements in an array, and when changing the list, it copies the array. Iterations and other accesses which are running at this time continue with the old array, avoiding necessity of synchronization between readers and writers (though writing itself needs to be synchronized). The normally fast set operations (especially contains()) are quite slow here, as the arrays will be searched in linear time.

    Use this only for really small sets which will be read (iterated) often and changed seldom. (Swing's listener-sets would be an example, but these are not really sets, and should be only used from the EDT anyway.)

  2. Collections.synchronizedSet will simply wrap a synchronized-block around each method of the original set. You should not access the original set directly. This means that no two methods of the set can be executed concurrently (one will block until the other finishes) - this is thread-safe, but you will not have concurrency if multiple threads are using the set. If you use the iterator, you usually still need to synchronize externally to avoid ConcurrentModificationExceptions when modifying the set between iterator calls. The performance will be like the performance of the original set (but with some synchronization overhead, and blocking if used concurrently).

    Use this if you only have low concurrency, and want to be sure all changes are immediately visible to the other threads.

  3. ConcurrentSkipListSet is the concurrent SortedSet implementation, with most basic operations in O(log n). It allows concurrent adding/removing and reading/iteration, where iteration may or may not tell about changes since the iterator was created. The bulk operations are simply multiple single calls, and not done atomically – other threads may observe only some of them.

    Obviously you can use this only if you have some total order on your elements. This looks like an ideal candidate for high-concurrency situations, for not-too-large sets (because of the O(log n)).

  4. For the ConcurrentHashMap (and the Set derived from it): Here most basic options are (on average, if you have a good and fast hashCode()) in O(1) (but might degenerate to O(n) when many keys have the same hash code), like for HashMap/HashSet. There is a limited concurrency for writing (the table is partitioned, and write access will be synchronized on the needed partition), while read access is fully concurrent to itself and the writing threads (but might not yet see the results of the changes currently being written). The iterator may or may not see changes since it was created, and bulk operations are not atomic. Resizing is slow (as for HashMap/HashSet), thus you should try to avoid this by estimating the needed size on creation (and using about 1/3 more of that, as it resizes when 3/4 full).

    Use this when you have large sets, a good (and fast) hash function and can estimate the set size and needed concurrency before creating the map.

  5. Are there other concurrent map implementations one could use here?

Paŭlo Ebermann
  • 73,284
  • 20
  • 146
  • 210
  • 1
    Just a sight correction in 1), the process of copying data into the new array must be locked down by synchronizing. Therefore, CopyOnWriteArraySet doesn't totally avoid the necessity of synchronization. – CaptainHastings Jun 05 '15 at 13:07
  • On `ConcurrentHashMap` based set, "thus try to avoid this by estimating the needed size on creation." The size you give to the map should be over 33% larger than your estimate (or known value), since the set resizes at 75% load. I use `expectedSize + 4 / 3 + 1` – Daren Oct 30 '15 at 12:21
  • @Daren I guess the first `+` is meant to be a `*`? – Paŭlo Ebermann Oct 30 '15 at 21:00
  • @PaŭloEbermann Of course... it should be `expectedSize * 4 / 3 + 1` – Daren Nov 19 '15 at 14:11
  • Shouldn't every concurrent map do? What about the [lock-free hash table](https://www.youtube.com/watch?v=HJ-719EGIts)? – maaartinus Jul 02 '17 at 02:37
  • 1
    For `ConcurrentMap` (or `HashMap`) in Java 8 if number of entries mapping to same bucket reaches threshold value (i believe it is 16) then the list gets changed to a binary search tree (red-black tree to be precised) and in that case look up time would be `O(lg n)` and not `O(n)`. – akhil_mittal Dec 28 '17 at 04:58
  • @i_am_zero The binary search tree is still searching by hash code, right? So if all your objects in the map have the same hash code, they will still need to be looked on individually, no matter whether you store them in a tree or list. `ConcurrentHashMap.TreeNode.findTreeNode` will look through both left and right branch if the hash codes are equal (and the key itself not). – Paŭlo Ebermann Sep 25 '19 at 18:51
  • @Paülo Ebermann. So, without looking at the jdk source code. This seems incorrect. Once we start looking up in a bucket (elements for which hashcode has same value), we lookup by equality. So the lookup in the bst would be based on value of key not the hashcode of key. – Ravi Sanwal Sep 29 '19 at 08:41
  • @RaviSanwal one bucket in the hash map are all entries for which the hash code modolo the map size is the same. As i_am_zero noted, when that bucket gets larger than a threshold, it is converted into a tree, sorted by hash code. It can happen that we have non-equal entries with same hash code, those will be all put into the same tree bucket (potentially with other entries too). The tree lookup will use the hash code, but after the hash is matched will also use the equals method. So it is correct, but will still degenerate to linear time if there are many entries with same hash code. – Paŭlo Ebermann Sep 29 '19 at 18:04
  • May I know, where do you get the information regarding `contains` method is slow ? As, I do not see it is mentioned here : https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/CopyOnWriteArraySet.html#contains-java.lang.Object- – Cheok Yan Cheng Jul 03 '22 at 19:07
  • @CheokYanCheng since the elements are stored in insertion order in an array, the fastest way `contains` can work is by iterating over the array and comparing each element, which is O(n). As the answer by Oleg Estekhin shows, it would be possible to implement it differently (and some implementations might choose to do so), but if that was generally the case, I would expect some text about the hash function mentioned in the class description. – Paŭlo Ebermann Jul 06 '22 at 19:55
21

It is possible to combine the contains() performance of HashSet with the concurrency-related properties of CopyOnWriteArraySet by using the AtomicReference<Set> and replacing the whole set on each modification.

The implementation sketch:

public abstract class CopyOnWriteSet<E> implements Set<E> {

    private final AtomicReference<Set<E>> ref;

    protected CopyOnWriteSet( Collection<? extends E> c ) {
        ref = new AtomicReference<Set<E>>( new HashSet<E>( c ) );
    }

    @Override
    public boolean contains( Object o ) {
        return ref.get().contains( o );
    }

    @Override
    public boolean add( E e ) {
        while ( true ) {
            Set<E> current = ref.get();
            if ( current.contains( e ) ) {
                return false;
            }
            Set<E> modified = new HashSet<E>( current );
            modified.add( e );
            if ( ref.compareAndSet( current, modified ) ) {
                return true;
            }
        }
    }

    @Override
    public boolean remove( Object o ) {
        while ( true ) {
            Set<E> current = ref.get();
            if ( !current.contains( o ) ) {
                return false;
            }
            Set<E> modified = new HashSet<E>( current );
            modified.remove( o );
            if ( ref.compareAndSet( current, modified ) ) {
                return true;
            }
        }
    }

}
tharindu_DG
  • 8,900
  • 6
  • 52
  • 64
Oleg Estekhin
  • 8,063
  • 5
  • 49
  • 52
  • Actually `AtomicReference` marks the value volatile. It means it makes sure no thread is reading out stale data and provides `happens-before` guarantee as code cannot be reordered by the compiler. But If only get/set methods of `AtomicReference` are used then we are actually marking our variable volatile in a fancy way. – akhil_mittal Dec 28 '17 at 05:05
  • This answer can't be upvoted enough because (1) unless I missed something, it will work for all collection types (2) none of the other classes provide a way to atomically update the entire collection at once... This is very useful. – Gili Aug 06 '18 at 17:44
  • I tried appropriating this verbatim but found it was labeled `abstract`, seemingly to avoid having to write several of the methods. I set about adding them, but ran into a roadblock with `iterator()`. I don't know how to maintain an iterator over this thing without breaking the model. Seems I always have to go through the `ref`, and might get a different underlying set each time, which requires getting a new iterator on the underlying set, which is useless to me, as it will start with item zero. Any insights? – nclark Sep 26 '19 at 14:14
  • Okay I guess the guarantee is, each customer getting a fixed snapshot in time, so the underlying collection's iterator would work fine if that's all you need. My use case is to allow competing threads to "claim" individual resources in it, and it won't work if they have different versions of the set. On second though... I guess my thread just needs to get a new iterator and try again if CopyOnWriteSet.remove(chosen_item) returns false... Which it would have to do regardless :) – nclark Sep 26 '19 at 14:33
  • Is `while ( true )` necessary here? – user3908406 Sep 10 '20 at 16:00
13

If the Javadocs don't help, you probably should just find a book or article to read about data structures. At a glance:

  • CopyOnWriteArraySet makes a new copy of the underlying array every time you mutate the collection, so writes are slow and Iterators are fast and consistent.
  • Collections.synchronizedSet() uses old-school synchronized method calls to make a Set threadsafe. This would be a low-performing version.
  • ConcurrentSkipListSet offers performant writes with inconsistent batch operations (addAll, removeAll, etc.) and Iterators.
  • Collections.newSetFromMap(new ConcurrentHashMap()) has the semantics of ConcurrentHashMap, which I believe isn't necessarily optimized for reads or writes, but like ConcurrentSkipListSet, has inconsistent batch operations.
Ryan Stewart
  • 126,015
  • 21
  • 180
  • 199
  • 2
    http://www.developer.com/java/article.php/10922_3829891_2/Selecting-the-Best-Java-Collection-Class-for-Your-Application.htm < even better than a book ) – ycomp Oct 18 '15 at 04:51
  • has moved to https://www.developer.com/java/selecting-the-best-java-collection-class-for-your-application/ – Ilya Serbis Apr 28 '21 at 20:01
0

Concurrent set of weak references

Another twist is a thread-safe set of weak references.

Such a set is handy for tracking subscribers in a pub-sub scenario. When a subscriber is going out of scope in other places, and therefore headed towards becoming a candidate for garbage-collection, the subscriber need not be bothered with gracefully unsubscribing. The weak reference allows the subscriber to complete its transition to being a candidate for garbage-collection. When the garbage is eventually collected, the entry in the set is removed.

While no such set is directly provided with the bundled classes, you can create one with a few calls.

First we start with making a Set of weak references by leveraging the WeakHashMap class. This is shown in the class documentation for Collections.newSetFromMap.

Set< YourClassGoesHere > weakHashSet = 
    Collections
    .newSetFromMap(
        new WeakHashMap< YourClassGoesHere , Boolean >()
    )
;

The Value of the map, Boolean, is irrelevant here as the Key of the map makes up our Set.

In a scenario such as pub-sub, we need thread-safety if the subscribers and publishers are operating on separate threads (quite likely the case).

Go one step further by wrapping as a synchronized set to make this set thread-safe. Feed into a call to Collections.synchronizedSet.

this.subscribers =
        Collections.synchronizedSet(
                Collections.newSetFromMap(
                        new WeakHashMap <>()  // Parameterized types `< YourClassGoesHere , Boolean >` are inferred, no need to specify.
                )
        );

Now we can add and remove subscribers from our resulting Set. And any “disappearing” subscribers will eventually be automatically removed after garbage-collection executes. When this execution happens depends on your JVM’s garbage-collector implementation, and depends on the runtime situation at the moment. For discussion and example of when and how the underlying WeakHashMap clears the expired entries, see this Question, *Is WeakHashMap ever-growing, or does it clear out the garbage keys? *.

Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154