Since the asker did not provide any example code, and there have been doubts about the benchmark mentioned in the comments and answers, I created a small test to see whether the removeAll
method is slower when the argument is a shuffled list (instead of a sorted list). And I confirmed the observation of the asker: The output of the test was roughly
100000 elements, sortedList and sortedList, 5023,090 ms, size 5000
100000 elements, shuffledList and sortedList, 5062,293 ms, size 5000
100000 elements, sortedList and shuffledList, 10657,438 ms, size 5000
100000 elements, shuffledList and shuffledList, 10700,145 ms, size 5000
I'll omit the code for this particular test here, because it also has been questioned (which - by the way - is perfectly justified! A lot of BS is posted on the web...).
So I did further tests, for which I'll provide the code here.
This may also not be considered as a definite answer. But I tried to adjust the tests so that they at least provide some strong evidence that the reason for the reduced performance is indeed what Svetlin Zarev mentioned in his answer (+1 and accept this if it convinces you). Namely, that the reason for the slowdown lies in the caching effects of the scattered accesses.
First of all: I am aware of many of the possible pitfalls when writing a microbenchmark (and so is the asker, according to his statements). However, I know that nobody will believe a lie benchmark, even if it is perfectly reasonable, unless it is performed with an appropriate microbenchmarking tool. So in order to show that the performance with a shuffled list is lower than with a sorted list, I created this simple JMH benchmark:
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.TimeUnit;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.infra.Blackhole;
@State(Scope.Thread)
public class RemoveAllBenchmarkJMH
{
@Param({"sorted", "shuffled"})
public String method;
@Param({"1000", "10000", "100000" })
public int numElements;
private List<Integer> left;
private List<Integer> right;
@Setup
public void initList()
{
left = new ArrayList<Integer>();
right = new ArrayList<Integer>();
for (int i=0; i<numElements; i++)
{
left.add(i);
}
int n = (int)(numElements * 0.95);
for (int i=0; i<n; i++)
{
right.add(i);
}
if (method.equals("shuffled"))
{
Collections.shuffle(right);
}
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void testMethod(Blackhole bh)
{
left.removeAll(right);
bh.consume(left.size());
}
}
The output of this one is as follows:
(method) (numElements) Mode Cnt Score Error Units
sorted 1000 avgt 50 52,055 ± 0,507 us/op
shuffled 1000 avgt 50 55,720 ± 0,466 us/op
sorted 10000 avgt 50 5341,917 ± 28,630 us/op
shuffled 10000 avgt 50 7108,845 ± 45,869 us/op
sorted 100000 avgt 50 621714,569 ± 19040,964 us/op
shuffled 100000 avgt 50 1110301,876 ± 22935,976 us/op
I hope that this helps to resolve doubts about the statement itself.
Although I admit that I'm not a JMH expert. If there is something wrong with this benchmark, please let me know
Now, these results have been roughly in line with my other, manual (non-JMH) microbenchmark. In order to create evidence for the fact that the shuffling is the problem, I created a small test that compares the performance using lists that are shuffled by different degrees. By providing a value between 0.0 and 1.0, one can limit the number of swapped elements, and thus the shuffledness of the list. (Of course, this is rather "pragmatic", as there are different options of how this could be implemented, considering the different possible (statistical) measures for "shuffledness").
The code looks as follows:
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.function.Function;
public class RemoveAllBenchmarkExt
{
public static void main(String[] args)
{
for (int n=10000; n<=100000; n+=10000)
{
runTest(n, sortedList() , sortedList());
runTest(n, sortedList() , shuffledList(0.00));
runTest(n, sortedList() , shuffledList(0.25));
runTest(n, sortedList() , shuffledList(0.50));
runTest(n, sortedList() , shuffledList(0.75));
runTest(n, sortedList() , shuffledList(1.00));
runTest(n, sortedList() , reversedList());
System.out.println();
}
}
private static Function<Integer, Collection<Integer>> sortedList()
{
return new Function<Integer, Collection<Integer>>()
{
@Override
public Collection<Integer> apply(Integer t)
{
List<Integer> list = new ArrayList<Integer>(t);
for (int i=0; i<t; i++)
{
list.add(i);
}
return list;
}
@Override
public String toString()
{
return "sorted";
}
};
}
private static Function<Integer, Collection<Integer>> shuffledList(
final double degree)
{
return new Function<Integer, Collection<Integer>>()
{
@Override
public Collection<Integer> apply(Integer t)
{
List<Integer> list = new ArrayList<Integer>(t);
for (int i=0; i<t; i++)
{
list.add(i);
}
shuffle(list, degree);
return list;
}
@Override
public String toString()
{
return String.format("shuffled(%4.2f)", degree);
}
};
}
private static void shuffle(List<Integer> list, double degree)
{
Random random = new Random(0);
int n = (int)(degree * list.size());
for (int i=n; i>1; i--)
{
swap(list, i-1, random.nextInt(i));
}
}
private static void swap(List<Integer> list, int i, int j)
{
list.set(i, list.set(j, list.get(i)));
}
private static Function<Integer, Collection<Integer>> reversedList()
{
return new Function<Integer, Collection<Integer>>()
{
@Override
public Collection<Integer> apply(Integer t)
{
List<Integer> list = new ArrayList<Integer>(t);
for (int i=0; i<t; i++)
{
list.add(i);
}
Collections.reverse(list);
return list;
}
@Override
public String toString()
{
return "reversed";
}
};
}
private static void runTest(int n,
Function<Integer, ? extends Collection<Integer>> leftFunction,
Function<Integer, ? extends Collection<Integer>> rightFunction)
{
Collection<Integer> left = leftFunction.apply(n);
Collection<Integer> right = rightFunction.apply((int)(n*0.95));
long before = System.nanoTime();
left.removeAll(right);
long after = System.nanoTime();
double durationMs = (after - before) / 1e6;
System.out.printf(
"%8d elements, %15s, duration %10.3f ms, size %d\n",
n, rightFunction, durationMs, left.size());
}
}
(Yes, it's very simple. However, if you think that the timings are completely useless, compare them to a JMH run, and after a few hours, you'll see that they are reasonable)
The timings for the last pass are as follows:
100000 elements, sorted, duration 6016,354 ms, size 5000
100000 elements, shuffled(0,00), duration 5849,537 ms, size 5000
100000 elements, shuffled(0,25), duration 7319,948 ms, size 5000
100000 elements, shuffled(0,50), duration 9344,408 ms, size 5000
100000 elements, shuffled(0,75), duration 10657,021 ms, size 5000
100000 elements, shuffled(1,00), duration 11295,808 ms, size 5000
100000 elements, reversed, duration 5830,695 ms, size 5000
One can clearly see that the timings are basically increasing linearly with the shuffledness.
Of course, all this is still not a proof, but at least an evidence that the answer by Svetlin Zarev is correct.