Java performance: Search and removal speed on removeAll()

Question

I had some fun comparing the speed of the removeAll(Collection<?> c) call declared in Collection. Now I know that micro-benchmarks are difficult to do right, and I won’t look at a few milliseconds difference, but I believe my results to be valid, since I ran them repeatedly and they are very reproducible.

Let’s assume I have two collections that are not too tiny, say 100,000 consecutive integer elements, and also that they mostly overlap, for instance 5,000 are in the left but not the right. Now I simply call:

left.removeAll(right);

Of course this all depends on the types of both the left and the right collection. It’s blazingly fast if the right collection is a hash map, because that’s where the look-ups are done. But looking closer, I noticed two results that I cannot explain. I tried all the tests both with an ArrayList that is sorted and with another that is shuffled (using Collections.shuffle(), if that is of importance).

The first weird result is:

00293  025%   shuffled ArrayList, HashSet
00090  008%     sorted ArrayList, HashSet

Now either removing elements from the sorted ArrayList is faster than removing from the shuffled list, or looking up consecutive values from the HashSet is faster that looking up random values.

Now the other one:

02311  011%     sorted ArrayList, shuffled ArrayList
01401  006%     sorted ArrayList,   sorted ArrayList

Now this suggests that the lookup in the sorted ArrayList (using a contains() call for each element of the list to the left) is faster than in the shuffled list. Now that would be quite easy if we could make use of the fact that it is sorted and use a binary search, but I do not do that.

Both results are mysterious to me. I cannot explain them by looking at the code or with my data-structure knowledge. Does it have anything to do with processor cache access patterns? Is the JIT compiler optimizing stuff? But if so, which? I performed warming up and run the tests a few times in a row, but perhaps there is a fundamental problem with my benchmark?

This is a silly question but have you double checked that the "shuffling" is not part of the timing? — posdef, Apr 14 '15 at 12:24
There may be many reasons (caching, branch prediction etc...). A http://stackoverflow.com/help/mcve might attract further, more detailed answers. — Marco13, Apr 14 '15 at 12:24
I just ran a simple test and can basically confirm the results. For the slower `sortedList.removeAll(shuffledList)`, I'd blame branch prediction ( http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array ). The slower `shuffledList.removeAll(set)` may, as mentioned above, be related to caching (and maybe branch prediction as well) — Marco13, Apr 14 '15 at 13:11
@Marco13 - I'm not going to be convinced unless I see >>your<< benchmark code, either. — Stephen C, Apr 14 '15 at 15:30

score 4 · Accepted Answer · edited May 23 '17 at 12:16

The reason for the performance difference is the memory access pattern: accessing elements which are consecutive in memory is faster than doing a random memory access (due to memory pre-fetching, cpu caches etc.)

When you initially populate the collection you create all the elements sequentially in the memory, so when you are traversing it (foreach, removeAll, etc) you are accessing consecutive memory regions which is cache friendly. When you shuffle the collection - the elements remain in the same order in memory, but the pointers to those elements are no longer in the same order, so when you are traversing the collection you'll be accessing for instance the 10th, the 1st, then the 5th element which is very cache unfriendly and ruins the performance.

You can look at this question where this effect is visible in greater detail: Why filtering an unsorted list is faster than filtering a sorted list

Good point, I just wonder if it would be possible to _prove_ that this is the reason for the observation of the OP. — posdef, Apr 14 '15 at 14:44
It would be possible to prove this if you had a hardware emulator that included accurate modelling of cache behaviour. But you could also get supporting (or otherwise) evidence by running with a range of "size" parameters on a range of hardware with different cache dimensions. — Stephen C, Apr 14 '15 at 22:53

score 4 · Answer 2 · edited May 23 '17 at 12:31

Since the asker did not provide any example code, and there have been doubts about the benchmark mentioned in the comments and answers, I created a small test to see whether the removeAll method is slower when the argument is a shuffled list (instead of a sorted list). And I confirmed the observation of the asker: The output of the test was roughly

100000 elements,   sortedList and   sortedList,  5023,090 ms, size 5000
100000 elements, shuffledList and   sortedList,  5062,293 ms, size 5000
100000 elements,   sortedList and shuffledList, 10657,438 ms, size 5000
100000 elements, shuffledList and shuffledList, 10700,145 ms, size 5000

I'll omit the code for this particular test here, because it also has been questioned (which - by the way - is perfectly justified! A lot of BS is posted on the web...).

So I did further tests, for which I'll provide the code here.

This may also not be considered as a definite answer. But I tried to adjust the tests so that they at least provide some strong evidence that the reason for the reduced performance is indeed what Svetlin Zarev mentioned in his answer (+1 and accept this if it convinces you). Namely, that the reason for the slowdown lies in the caching effects of the scattered accesses.

First of all: I am aware of many of the possible pitfalls when writing a microbenchmark (and so is the asker, according to his statements). However, I know that nobody will believe a ~~lie~~ benchmark, even if it is perfectly reasonable, unless it is performed with an appropriate microbenchmarking tool. So in order to show that the performance with a shuffled list is lower than with a sorted list, I created this simple JMH benchmark:

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;

@State(Scope.Thread)
public class RemoveAllBenchmarkJMH
{

    @Param({"sorted", "shuffled"})
    public String method;

    @Param({"1000", "10000", "100000" })
    public int numElements;

    private List<Integer> left;
    private List<Integer> right;

    @Setup
    public void initList()
    {
        left = new ArrayList<Integer>();
        right = new ArrayList<Integer>();
        for (int i=0; i<numElements; i++)
        {
            left.add(i);
        }
        int n = (int)(numElements * 0.95);
        for (int i=0; i<n; i++)
        {
            right.add(i);
        }
        if (method.equals("shuffled"))
        {
            Collections.shuffle(right);
        }
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.MICROSECONDS)
    public void testMethod(Blackhole bh)
    {
        left.removeAll(right);
        bh.consume(left.size());
    }
}

The output of this one is as follows:

(method)  (numElements)  Mode  Cnt        Score       Error  Units
  sorted           1000  avgt   50       52,055 ±     0,507  us/op
shuffled           1000  avgt   50       55,720 ±     0,466  us/op
  sorted          10000  avgt   50     5341,917 ±    28,630  us/op
shuffled          10000  avgt   50     7108,845 ±    45,869  us/op
  sorted         100000  avgt   50   621714,569 ± 19040,964  us/op
shuffled         100000  avgt   50  1110301,876 ± 22935,976  us/op

I hope that this helps to resolve doubts about the statement itself.

_{Although I admit that I'm not a JMH expert. If there is something wrong with this benchmark, please let me know}

Now, these results have been roughly in line with my other, manual (non-JMH) microbenchmark. In order to create evidence for the fact that the shuffling is the problem, I created a small test that compares the performance using lists that are shuffled by different degrees. By providing a value between 0.0 and 1.0, one can limit the number of swapped elements, and thus the shuffledness of the list. (Of course, this is rather "pragmatic", as there are different options of how this could be implemented, considering the different possible (statistical) measures for "shuffledness").

The code looks as follows:

import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.function.Function;

public class RemoveAllBenchmarkExt
{
    public static void main(String[] args)
    {
        for (int n=10000; n<=100000; n+=10000)
        {
            runTest(n, sortedList()  , sortedList());
            runTest(n, sortedList()  , shuffledList(0.00));
            runTest(n, sortedList()  , shuffledList(0.25));
            runTest(n, sortedList()  , shuffledList(0.50));
            runTest(n, sortedList()  , shuffledList(0.75));
            runTest(n, sortedList()  , shuffledList(1.00));
            runTest(n, sortedList()  , reversedList());
            System.out.println();
        }
    }

    private static Function<Integer, Collection<Integer>> sortedList()
    {
        return new Function<Integer, Collection<Integer>>()
        {
            @Override
            public Collection<Integer> apply(Integer t)
            {
                List<Integer> list = new ArrayList<Integer>(t);
                for (int i=0; i<t; i++)
                {
                    list.add(i);
                }
                return list;
            }

            @Override
            public String toString()
            {
                return "sorted";
            }
        };
    }

    private static Function<Integer, Collection<Integer>> shuffledList(
        final double degree)
    {
        return new Function<Integer, Collection<Integer>>()
        {
            @Override
            public Collection<Integer> apply(Integer t)
            {
                List<Integer> list = new ArrayList<Integer>(t);
                for (int i=0; i<t; i++)
                {
                    list.add(i);
                }
                shuffle(list, degree);
                return list;
            }

            @Override
            public String toString()
            {
                return String.format("shuffled(%4.2f)", degree);
            }
        };
    }


    private static void shuffle(List<Integer> list, double degree)
    {
        Random random = new Random(0);
        int n = (int)(degree * list.size());
        for (int i=n; i>1; i--)
        {
            swap(list, i-1, random.nextInt(i));
        }
    }
    private static void swap(List<Integer> list, int i, int j)
    {
        list.set(i, list.set(j, list.get(i)));
    }

    private static Function<Integer, Collection<Integer>> reversedList()
    {
        return new Function<Integer, Collection<Integer>>()
        {
            @Override
            public Collection<Integer> apply(Integer t)
            {
                List<Integer> list = new ArrayList<Integer>(t);
                for (int i=0; i<t; i++)
                {
                    list.add(i);
                }
                Collections.reverse(list);
                return list;
            }

            @Override
            public String toString()
            {
                return "reversed";
            }
        };
    }


    private static void runTest(int n,
        Function<Integer, ? extends Collection<Integer>> leftFunction,
        Function<Integer, ? extends Collection<Integer>> rightFunction)
    {

        Collection<Integer> left = leftFunction.apply(n);
        Collection<Integer> right = rightFunction.apply((int)(n*0.95));

        long before = System.nanoTime();
        left.removeAll(right);
        long after = System.nanoTime();
        double durationMs = (after - before) / 1e6;

        System.out.printf(
            "%8d elements, %15s, duration %10.3f ms, size %d\n",
            n, rightFunction, durationMs, left.size());
    }
}

(Yes, it's very simple. However, if you think that the timings are completely useless, compare them to a JMH run, and after a few hours, you'll see that they are reasonable)

The timings for the last pass are as follows:

100000 elements,          sorted, duration   6016,354 ms, size 5000
100000 elements,  shuffled(0,00), duration   5849,537 ms, size 5000
100000 elements,  shuffled(0,25), duration   7319,948 ms, size 5000
100000 elements,  shuffled(0,50), duration   9344,408 ms, size 5000
100000 elements,  shuffled(0,75), duration  10657,021 ms, size 5000
100000 elements,  shuffled(1,00), duration  11295,808 ms, size 5000
100000 elements,        reversed, duration   5830,695 ms, size 5000

One can clearly see that the timings are basically increasing linearly with the shuffledness.

Of course, all this is still not a proof, but at least an evidence that the answer by Svetlin Zarev is correct.

I guess you're perfectly right. It should be possible to start with a random collection and prove the opposite: the more you (bubble) sort, the slower the removal. — maaartinus, Apr 15 '15 at 07:06
@maaartinus Actually, I thought about this, because I also considered that the fact that the numbers are increasing might have an influence here (in *any* way). That's why I added the "reversed" list test: It shows that rather the sequential nature (and *not* the exact *order*) make it fast. But indeed, I'll consider adding a test like the one you suggested. — Marco13, Apr 15 '15 at 08:31

score 1 · Answer 3 · answered Apr 14 '15 at 12:44

Looking at the source code for ArrayList.removeAll() (OpenJDK7-b147) it appears that the it delegates to a private method called batchRemove() which is as follows:

663     private boolean batchRemove(Collection<?> c, boolean complement) {
664         final Object[] elementData = this.elementData;
665         int r = 0, w = 0;
666         boolean modified = false;
667         try {
668             for (; r < size; r++)
669                 if (c.contains(elementData[r]) == complement)
670                     elementData[w++] = elementData[r];
671         } finally {
672             // Preserve behavioral compatibility with AbstractCollection,
673             // even if c.contains() throws.
674             if (r != size) {
675                 System.arraycopy(elementData, r,
676                                  elementData, w,
677                                  size - r);
678                 w += size - r;
679             }
680             if (w != size) {
681                 for (int i = w; i < size; i++)
682                     elementData[i] = null;
683                 modCount += size - w;
684                 size = w;
685                 modified = true;
686             }
687         }
688         return modified;
689     }

It practically loops through the array and has a bunch of c.contains() calls. Basically there's no reason why this iteration would go faster for a sorted array.

I second StephenC's doubt about the benchmark, and believe that it'd be more fruitful for you to scrutinize the benchmark code before digging in any deeper into cache access patterns etc.

Also if the benchmark code is not the culprit, it would be interesting to know the java version, and the OS/arch etc.

There are two "invisible things" contained in every program: Memory access pattern and branching pattern. The OP's benchmark may be wrong, but these pattern influence performance a lot. The faster and simpler the code, the bigger this influence. — maaartinus, Apr 15 '15 at 06:53

Stephen C · Answer 4 · 2015-04-14T12:28:04.023

0

Now I know that micro-benchmarks are difficult to do right, and I won’t look at a few milliseconds difference, but I believe my results to be valid, since I ran them repeatedly and they are very reproducible.

That does not convince me. The behaviour of an flawed benchmark can be 100% reproducible.

I suspect that ... in fact ... a flaw or flaws in your benchmark >>is<< the cause of your strange results. It often is.

... but perhaps there is a fundamental problem with my benchmark?

Yes (IMO).

Show us the benchmark code if you want a more detailed answer.

edited Apr 14 '15 at 12:28

answered Apr 14 '15 at 12:24

Stephen C

698,415
94
811
1,216

Yes it is. It is a plausible explanation of the OP's observation. – Stephen C Apr 14 '15 at 12:26
It is a plausible explanation. It would be less rude if you didn’t say ‘Yes (IMO).’ – Michael Piefel Apr 14 '15 at 13:52
@MichaelPiefel - I can't win. Marco says I'm not answering the question. I answer it and you say it is "rude". FWIW - IMO it is >>not<< rude to agree with the OP's own diagnosis, even if he intended it to be a "rhetorical device" rather than a real question. – Stephen C Apr 14 '15 at 15:28
@StephenC – Where I live, when people speculate, particularly when they speculate about other’s weaknesses, and even when they just agree to self-criticism, they use words such as ‘perhaps’ or ‘most likely’ to soften their tone. It’s just the typical misunderstanding of people on the Internet, and really nothing to worry about. – Michael Piefel Apr 14 '15 at 18:36
@MichaelPiefel - There is no speculation in my Answer about your character. Just speculation about your benchmark. And a statement that I find your reasoning unconvincing. If you are not prepared to wear criticism about the validity of your statements on StackOverflow, then "perhaps" you shouldn't make them. And I still don't know why you won't show us your benchmark code ... and avoid the speculation on their validity. – Stephen C Apr 14 '15 at 22:47
1

@MichaelPiefel Actually, the vast majority of Java performance questions is based on broken benchmarks. To get a non-sense, it's enough to measure method1 *and then* method2. You may find out that method2 is always and reproducibly much slower while it's actually faster. Most people don't know it and it's not rude to tell them. – maaartinus Apr 15 '15 at 07:01
I have created a snippet of the code on https://bitbucket.org/snippets/piefel/r4Rb – it’s open to be ripped apart. Note that I have changed the order of the method calls multiple times, played around with some simplistic warming process etc. The code you see there is not the same anymore as the code that created the timings in the questions. However, as much as I shuffle it all around, the basic result stayed the same. I’m heading over to Caliper now. – Michael Piefel Apr 15 '15 at 11:40

Java performance: Search and removal speed on removeAll()

4 Answers4