Efficient way of comparing two arraylists in order to retain a delta between them

Question

Here's the problem that I have:

I need to compare two ArrayLists and return if they are the same or if they are different, return the elements that are new from one of them, the pivot so to speak.

Here's the behaviour of the data-set:

the two ArrayLists are made of Strings
they're populated from the same source so are most of the time the same
are ordered (in the sense of the custom logic attached to them)
there will never be any empty Strings
all of the Strings have the same length, always
no duplicates

What I want:

the fastest possible way of achieving my two goals, whichever the case
using only Java 1.6 standard library features, I'd prefer not to implement a hybrid class that emulated something from List and then Set for example.

Example:

A: [ 'a', 'b', 'c', 'd']   

B: [ 'a', 'c', 'd']

Result: Lists are different, return the element 'b'; A will be the 'work' List, we will make comparisons based on what's new in this ArrayList, since B will never change.

Thanks for any replies and your input.

By *ordered* you mean that these list are sorted alphabetically or the elements in each `List` follow some kind of order, so we could say that `list1.get(0).equals(list2.get(0))`? — Luiggi Mendoza, Aug 18 '15 at 21:26
Can the same string be contained multiple times in one list? (Or can they be converted to `Set`s without losing information?). @ The down/close-voters: Give this question a chance. It's not great, but there are worse ones... — Marco13, Aug 18 '15 at 21:29
@Luigi: Ordered means that they when the two arraylists are populated, the insertion order is the same with their 'sorted' order. And it will never change. Whenever a new element is brought in in one of the arrayLists it is always smaller/bigger than the next beside him. — user3046061, Aug 18 '15 at 21:30
So, in other words, you mean that both `List`s are always sorted, right? — Luiggi Mendoza, Aug 18 '15 at 21:33
So, if one list were `[a, b, d, e]` and another was `[a, b, c, e]`, what are you looking to return? I would assume this would be false (not the same), but would it return [c, d], [d, c], or something else? — Jake Griffin, Aug 18 '15 at 21:34
@PM 77 : using a LinkedHashSet, using containsAll and then removeAll to obtain the result that I need. I want to know the fastest possible way to tackle this problem. — user3046061, Aug 18 '15 at 21:34
Also, please post the code you have so far. Stack Overflow is not a code-writing service. — Jake Griffin, Aug 18 '15 at 21:36
If using a LinkedHashSet is possible, then it will likely be the fastest way (asymptotically). A single call to `list.contains(...)` has the same complexity as the whole set operations. — Marco13, Aug 18 '15 at 21:37
@Marco13 yes, but the call to `Set#add` won't necessarily be O(1). — Luiggi Mendoza, Aug 18 '15 at 21:38
Create a copy of one of the lists, use retainAll or removeAll, passing it the second list, this will either give you a list containing all the items matching the two lists or the difference, depending on which method you use (you only need a coy if you want to retain the state of both the original lists). — MadProgrammer, Aug 18 '15 at 21:38
Related: [How do I calculate the delta (inserted/deleted/moved indexes) of two lists?](http://stackoverflow.com/q/30653705/572670). Main difference is here lists seems to be sorted. Other than that - pretty much the same question in different dressing. — amit, Aug 18 '15 at 21:47
@LuiggiMendoza According to the docs, `LinkedHashSet` does an `add` in O(n). If you refer to rehashing, that's a different story. — Marco13, Aug 18 '15 at 22:26

Bill K · Accepted Answer · 2015-08-18T21:51:20.987

5

Your fastest possible requirement bothers me a lot--I'm quite against optimizing--I generally consider early optimization one of the worst programming practices around.

If you really want to do that, just walk the two lists in order.

if the first entries match, you put that one into a "Same" pile and increment both indexes. If they are different, put the first (lower/less-than) one into a "Different" pile and increment that lists index. Loop this way until you hit the end of one list (any remaining in the other obviously go into the "Different" collection.

That should give you "close" to the fastest way. If you want the absolute fastest then you have to start by using arrays, not lists, and then pay a lot of attention to what else you do every step of the way--but the algorithm should still be pretty close to optimal.

As an example of sub-optimal but much more readable you could use some set manipulation.

Set set1=new HashSet(list1)
Set set2=new HashSet(list2)
Set same=set1.retainAll(set2) // I forget if retainAll modifies set1--if so you need to copy it first
set1.removeAll(list2)
set2.removeAll(list1)
Set different=set1.addAll(set2)

// at this point same contains all the similar values and different contains the ones that don't match.  Done.

This is short and readable and probably more performant than you would think. It would be bad practice to code your own if something like this works well enough (Say, in GUI code where speed is unlikely to matter).

edited Aug 18 '15 at 21:51

answered Aug 18 '15 at 21:34

Bill K

62,186
18
105
157

*I'm quite against optimizing* it's way different to give birth a process with optimum algorithms like the one you're prividing here vs using brute-force already-known slow algorithms or (even worse) premature optimizations. Still, upvoting since this answer provides a simple and nice way to achieve it, assuming that these lists are always sorted and than when you're traversing them for comparing the results the data won't change. – Luiggi Mendoza Aug 18 '15 at 21:36
I think I agree @LuiggiMendoza. The way I usually phrase it though is that I don't ever pre-optimize but I consider a lot of code "Wrong", like an insertion sort into an ArrayList or indexed access into a LinkedList. Beyond that bad code I'd always rather see a readable solution than the best. – Bill K Aug 18 '15 at 21:53
The main logic is a bit more complicated than that. But I have already implemented the comparison with HashSets. This whole question is basically me just asking "For this problem that I have, am I missing a much faster way to solve it?"...From all the replies, I see that not. – user3046061 Aug 18 '15 at 21:54
@user3046061 looks like you're looking for a micro benchmark on your current solution and a new one proposed. Still, I would suggest using the `Set` approach by the reasons BillK explains. If you notice that this algorithms is a bottleneck in your app demonstrated by the usage of a profiler, then you have to start looking for improvement, otherwise, you have a golden rule: *if ain't broken, don't fix it* – Luiggi Mendoza Aug 18 '15 at 21:56
@user3046061 Yeah I think walking the lists in parallel is about as fast as you will get and it probably will still be comparable with the hash set solution IF you don't count the time to create the initial 2 hashes or sort the original two lists. Creating hashes/sorting lists is probably the most time-consuming part of this. – Bill K Aug 18 '15 at 21:56

score 2 · Answer 2 · 2015-08-18T21:49:44.597

Pretty simple (assuming the list is ordered ascending, can be easily changed for descending order):

ArrayList<String> delta(ArrayList<String> a , ArrayList<String> b , Comparator<String> comp){
    if(a.isEmpty())
        return new ArrayList(b);
    if(b.isEmpty())
        return new ArrayList(a);

    Iterator<String> it_a = a.iterator();
    Iterator<String> it_b = b.iterator();

    ArrayList<String> delta = new ArrayList<>();

    String a_s = it_a.next() , b_s = it_b.next();
    boolean onechecked = false;

    while(!onechecked){
        int comp_v = comp.compare(a_s , b_s);

        if(comp_v == 0){
            //strings are equal -> ommit them
            if(it_a.hasNext())
                a_s = it_a.next();
            else
                onechecked = true;

            if(it_b.hasNext())
                b_s = it_b.next();
            else
                onechecked = true;
        }else if(comp_v < 0){
            //a_s is not part of b
            delta.add(a_s);
            if(it_a.hasNext())
                a_s = it_a.next();
            else
                onechecked = true;
        }else{
            //b_s is not part of a
            delta.add(b_s);
            if(it_b.hasNext())
                b_s = it_b.next();
            else
                onechecked = true;
        }
    }

    //add remaining items
    delta.add(it_a.hasNext() ? a_s : b_s);

    for(Iterator<String> it = (it_a.hasNext() ? it_a : it_b) ; it.hasNext();)
        delta.add(it.next());

    return delta;
}

Sorry for not adding any explanation, but the code has to speak for itself, since i have no idea how to explain it.

Thanks for the code reply. I'll check it against what I already have to see how it performs. — user3046061, Aug 18 '15 at 21:48
This answer really helped me. I was comparing two lists, each having roughly 40,000 entries, and looking for the deltas. The poster doesn't explain the code, but in essence, you start by looping through both lists at the same time and comparing the first item from each list: if they match, you move to the next item in both lists; if they don't match, then add the missing item to the list of deltas, and advance only the list with the missing item. Once you've reached the end of either list, add the remaining items from the unfinished list. This turned a 5-minute process into sub 10-seconds. — AbetotheMax, Jan 26 '21 at 23:45

Efficient way of comparing two arraylists in order to retain a delta between them

2 Answers2