128

I am trying to optimize a piece of code which compares elements of list.

Eg.

public void compare(Set<Record> firstSet, Set<Record> secondSet){
    for(Record firstRecord : firstSet){
        for(Record secondRecord : secondSet){
            // comparing logic
        }
    }
}

Please take into account that the number of records in sets will be high.

Thanks

Shekhar

Mridang Agarwalla
  • 43,201
  • 71
  • 221
  • 382
Shekhar
  • 5,771
  • 10
  • 42
  • 48
  • 9
    It is not possible to optimize the loops without knowing (and modifying) the comparing logic. Could you show more of your code? – josefx Jul 27 '10 at 06:46

9 Answers9

194
firstSet.equals(secondSet)

It really depends on what you want to do in the comparison logic... ie what happens if you find an element in one set not in the other? Your method has a void return type so I assume you'll do the necessary work in this method.

More fine-grained control if you need it:

if (!firstSet.containsAll(secondSet)) {
  // do something if needs be
}
if (!secondSet.containsAll(firstSet)) {
  // do something if needs be
}

If you need to get the elements that are in one set and not the other.
EDIT: set.removeAll(otherSet) returns a boolean, not a set. To use removeAll(), you'll have to copy the set then use it.

Set one = new HashSet<>(firstSet);
Set two = new HashSet<>(secondSet);
one.removeAll(secondSet);
two.removeAll(firstSet);

If the contents of one and two are both empty, then you know that the two sets were equal. If not, then you've got the elements that made the sets unequal.

You mentioned that the number of records might be high. If the underlying implementation is a HashSet then the fetching of each record is done in O(1) time, so you can't really get much better than that. TreeSet is O(log n).

Line
  • 1,529
  • 3
  • 18
  • 42
Noel M
  • 15,812
  • 8
  • 39
  • 47
  • 3
    The implementation of equals() and hashcode() for the Record class is equally important, when invoking equals() on the Set. – Vineet Reynolds Jul 27 '10 at 06:44
  • 1
    I'm not sure that the the removeAll() examples are correct. removeAll() returns a boolean, not another Set. The elements in secondSet are actually removed from firstSet and true is returned if a change has been made. – Richard Corfield Oct 21 '12 at 20:30
  • 4
    The removeAll example still isn't right because you haven't made copies (Set one = firstSet; Set two = secondSet). I'd use the copy constructor. – Michael Rusch Jun 07 '13 at 14:58
  • 1
    Actually, the default implementation of `equals` is faster than two calls to `containsAll` in the worst case; see my answer. – Stephen C Sep 13 '13 at 09:55
  • 6
    You need to do Set one = new HashSet(firstSet), otherwise the items from firstSet and secondSet will get removed. – Bonton255 Jun 20 '17 at 02:11
74

If you simply want to know if the sets are equal, the equals method on AbstractSet is implemented roughly as below:

    public boolean equals(Object o) {
        if (o == this)
            return true;
        if (!(o instanceof Set))
            return false;
        Collection c = (Collection) o;
        if (c.size() != size())
            return false;
        return containsAll(c);
    }

Note how it optimizes the common cases where:

  • the two objects are the same
  • the other object is not a set at all, and
  • the two sets' sizes are different.

After that, containsAll(...) will return false as soon as it finds an element in the other set that is not also in this set. But if all elements are present in both sets, it will need to test all of them.

The worst case performance therefore occurs when the two sets are equal but not the same objects. That cost is typically O(N) or O(NlogN) depending on the implementation of this.containsAll(c).

And you get close-to-worst case performance if the sets are large and only differ in a tiny percentage of the elements.


UPDATE

If you are willing to invest time in a custom set implementation, there is an approach that can improve the "almost the same" case.

The idea is that you need to pre-calculate and cache a hash for the entire set so that you could get the set's current hashcode value in O(1). Then you can compare the hashcode for the two sets as an acceleration.

How could you implement a hashcode like that? Well if the set hashcode was:

  • zero for an empty set, and
  • the XOR of all of the element hashcodes for a non-empty set,

then you could cheaply update the set's cached hashcode each time you added or removed an element. In both cases, you simply XOR the element's hashcode with the current set hashcode.

Of course, this assumes that element hashcodes are stable while the elements are members of sets. It also assumes that the element classes hashcode function gives a good spread. That is because when the two set hashcodes are the same you still have to fall back to the O(N) comparison of all elements.


You could take this idea a bit further ... at least in theory.

WARNING - This is highly speculative. A "thought experiment" if you like.

Suppose that your set element class has a method to return a crypto checksums for the element. Now implement the set's checksums by XORing the checksums returned for the elements.

What does this buy us?

Well, if we assume that nothing underhand is going on, the probability that any two unequal set elements have the same N-bit checksums is 2-N. And the probability 2 unequal sets have the same N-bit checksums is also 2-N. So my idea is that you can implement equals as:

    public boolean equals(Object o) {
        if (o == this)
            return true;
        if (!(o instanceof Set))
            return false;
        Collection c = (Collection) o;
        if (c.size() != size())
            return false;
        return checksums.equals(c.checksums);
    }

Under the assumptions above, this will only give you the wrong answer once in 2-N time. If you make N large enough (e.g. 512 bits) the probability of a wrong answer becomes negligible (e.g. roughly 10-150).

The downside is that computing the crypto checksums for elements is very expensive, especially as the number of bits increases. So you really need an effective mechanism for memoizing the checksums. And that could be problematic.

And the other downside is that a non-zero probability of error may be unacceptable no matter how small the probability is. (But if that is the case ... how do you deal with the case where a cosmic ray flips a critical bit? Or if it simultaneously flips the same bit in two instances of a redundant system?)

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • It should be if (checksumsDoNotMatch(0)) return false; else return doHeavyComparisonToMakeSureTheSetsReallyMatch(o); – Esko Piirainen Oct 17 '19 at 07:58
  • Not necessarily. If the probability of two checksums matching for non-equal sets, is small enough I posit that you can skip the comparison. Do the math. – Stephen C Oct 17 '19 at 08:17
  • For the implementation, is there a specific reason that instanceof is used instead of getClass() and ==? – Cloud Walker Nov 01 '21 at 23:37
  • 1
    Well `instanceof Set` means any `Set`. But `getClass() == ...` would test for a specific `Set` implementation class. The semantics are different. – Stephen C Nov 02 '21 at 00:28
  • The javadoc for `Set.equals` says: *"Returns true if the specified object is also a set, the two sets have the same size, and every member of the specified set is contained in this set (or equivalently, every member of this set is contained in the specified set). **This definition ensures that the equals method works properly across different implementations of the set interface**."* So `instanceof` must be used to implement the specified behavior. – Stephen C Nov 02 '21 at 00:33
19

There is a method in Guava Sets which can help here:

public static <E>  boolean equals(Set<? extends E> set1, Set<? extends E> set2){
return Sets.symmetricDifference(set1,set2).isEmpty();
}
husayt
  • 14,553
  • 8
  • 53
  • 81
8

There's an O(N) solution for very specific cases where:

  • the sets are both sorted
  • both sorted in the same order

The following code assumes that both sets are based on the records comparable. A similar method could be based on on a Comparator.

    public class SortedSetComparitor <Foo extends Comparable<Foo>> 
            implements Comparator<SortedSet<Foo>> {

        @Override
        public int compare( SortedSet<Foo> arg0, SortedSet<Foo> arg1 ) {
            Iterator<Foo> otherRecords = arg1.iterator();
            for (Foo thisRecord : arg0) {
                // Shorter sets sort first.
                if (!otherRecords.hasNext()) return 1;
                int comparison = thisRecord.compareTo(otherRecords.next());
                if (comparison != 0) return comparison;
            }
            // Shorter sets sort first
            if (otherRecords.hasNext()) return -1;
            else return 0;
        }
    }
Philip Couling
  • 13,581
  • 5
  • 53
  • 85
6

You have the following solution from https://www.mkyong.com/java/java-how-to-compare-two-sets/

public static boolean equals(Set<?> set1, Set<?> set2){

    if(set1 == null || set2 ==null){
        return false;
    }

    if(set1.size() != set2.size()){
        return false;
    }

    return set1.containsAll(set2);
}

Or if you prefer to use a single return statement:

public static boolean equals(Set<?> set1, Set<?> set2){

  return set1 != null 
    && set2 != null 
    && set1.size() == set2.size() 
    && set1.containsAll(set2);
}
Ignasi
  • 5,887
  • 7
  • 45
  • 81
  • 2
    Or maybe simply use the `equals()` method from `AbstractSet` (shipped with JDK) which is almost the same as the solution here except for the additional _null_ checks. [Java-11 Set Interface](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Set.html#equals(java.lang.Object)) – Chaithu Narayana Jun 10 '20 at 10:44
5

If you are using Guava library it's possible to do:

        SetView<Record> added = Sets.difference(secondSet, firstSet);
        SetView<Record> removed = Sets.difference(firstSet, secondSet);

And then make a conclusion based on these.

riwnodennyk
  • 8,140
  • 4
  • 35
  • 37
3

I would put the secondSet in a HashMap before the comparison. This way you will reduce the second list's search time to n(1). Like this:

HashMap<Integer,Record> hm = new HashMap<Integer,Record>(secondSet.size());
int i = 0;
for(Record secondRecord : secondSet){
    hm.put(i,secondRecord);
    i++;
}
for(Record firstRecord : firstSet){
    for(int i=0; i<secondSet.size(); i++){
    //use hm for comparison
    }
}
soufrk
  • 825
  • 1
  • 10
  • 24
2
public boolean equals(Object o) {
        if (o == this)
            return true;
        if (!(o instanceof Set))
            return false;

        Set<String> a = this;
        Set<String> b = o;
        Set<String> thedifference_a_b = new HashSet<String>(a);


        thedifference_a_b.removeAll(b);
        if(thedifference_a_b.isEmpty() == false) return false;

        Set<String> thedifference_b_a = new HashSet<String>(b);
        thedifference_b_a.removeAll(a);

        if(thedifference_b_a.isEmpty() == false) return false;

        return true;
    }
Zahran
  • 419
  • 4
  • 10
0

I think method reference with equals method can be used. We assume that the object type without a shadow of a doubt has its own comparison method. Plain and simple example is here,

Set<String> set = new HashSet<>();
set.addAll(Arrays.asList("leo","bale","hanks"));

Set<String> set2 = new HashSet<>();
set2.addAll(Arrays.asList("hanks","leo","bale"));

Predicate<Set> pred = set::equals;
boolean result = pred.test(set2);
System.out.println(result);   // true