From your code sample, your method of time measurement borders on Micro Benchmarking, for which simply measuring time for a single execution is misleading.
You can see it discussed at length in the following StackOverflow post: How do I write a correct micro-benchmark in Java?
I've written a more accurate benchmark to obtain a more precise measurment of your sample code. The code has run on a QuadCore i7 with multithreading (but it's a laptop, due to power and heat management, results may be slightly biased against multithreaded code that produces more heat).
@Benchmark
public void testSequentialFor(Blackhole bh, Words words) {
List<String> filtered = new ArrayList<>();
for (String word : words.toSort) {
if (word.startsWith(words.prefix)) {
filtered.add(word);
}
}
bh.consume(filtered);
}
@Benchmark
public void testParallelStream(Blackhole bh, Words words) {
bh.consume(words.toSort.parallelStream()
.filter(w -> w.startsWith(words.prefix))
.collect(Collectors.toList())
);
}
@Benchmark
public void testManualThreading(Blackhole bh, Words words) {
// This is quick and dirty, bit gives a decent baseline as to
// what a manually threaded partitionning can achieve.
List<Future<List<String>>> async = new ArrayList<>();
// this has to be optimized to avoid creating "almost empty" work units
int batchSize = words.size / ForkJoinPool.commonPool().getParallelism();
int numberOfDispatchedWords = 0;
while (numberOfDispatchedWords < words.toSort.size()) {
int start = numberOfDispatchedWords;
int end = numberOfDispatchedWords + batchSize;
async.add(words.threadPool.submit(() -> {
List<String> batch = words.toSort.subList(start, Math.min(end, words.toSort.size()));
List<String> result = new ArrayList<>();
for (String word : batch) {
if (word.startsWith(words.prefix)) {
result.add(word);
}
}
return result;
}));
numberOfDispatchedWords += batchSize;
}
List<String> result = new ArrayList<>();
for (Future<List<String>> asyncResult : async) {
try {
result.addAll(asyncResult.get());
} catch (Exception e) {
e.printStackTrace();
}
}
bh.consume(result);
}
@State(Scope.Benchmark)
public static class Words {
ExecutorService threadPool = ForkJoinPool.commonPool();
List<String> toSort;
@Param({"100", "1000", "10000", "100000"})
private int size;
private String prefix = "AA";
@Setup
public void prepare() {
//a 4 to 13 letters long, random word
//for more precision, it should not be that random (use a fixed seed), but given the simple nature of the fitlering, I guess it's ok this way
Supplier<String> wordCreator = () -> RandomStringUtils.random(4 + ThreadLocalRandom.current().nextInt(10));
toSort = Stream.generate(wordCreator).limit(size).collect(Collectors.toList());
}
}
Here are the results
Benchmark (size) Mode Cnt Score Error Units
PerfTest.testManualThreading 100 thrpt 6 95971,811 ± 1100,589 ops/s
PerfTest.testManualThreading 1000 thrpt 6 76293,983 ± 1632,959 ops/s
PerfTest.testManualThreading 10000 thrpt 6 34630,814 ± 2660,058 ops/s
PerfTest.testManualThreading 100000 thrpt 6 5956,552 ± 529,368 ops/s
PerfTest.testParallelStream 100 thrpt 6 69965,462 ± 451,418 ops/s
PerfTest.testParallelStream 1000 thrpt 6 59968,271 ± 774,859 ops/s
PerfTest.testParallelStream 10000 thrpt 6 29079,957 ± 513,244 ops/s
PerfTest.testParallelStream 100000 thrpt 6 4217,146 ± 172,781 ops/s
PerfTest.testSequentialFor 100 thrpt 6 3553930,640 ± 21142,150 ops/s
PerfTest.testSequentialFor 1000 thrpt 6 356217,937 ± 7446,137 ops/s
PerfTest.testSequentialFor 10000 thrpt 6 28894,748 ± 674,929 ops/s
PerfTest.testSequentialFor 100000 thrpt 6 1725,735 ± 13,273 ops/s
So the sequential version looks way faster up to a few thousand elements, they are on par with manual threading a bit before 10k, and with parallel stream a bit after, and threaded code performs better from there on.
Considering the amount of code needed to write the "manual threading variant", and the risk of creating a bug there or an inefficiency by badling calculating batch size, I'd probably not elect that option even if it looks like it can be faster than streams for huge lists.
I would not try to sort first, then binary search as filtering is a O(N) operation, and sorting a O(Nlog(N)) (on top of which you have to add a binary search). So unless you have a very precise pattern on the data I do not see it working at your advantage.
Be aware though not to draw conclusion this benchmark can not support. For one thing, it is based on the assumption that the filtering is the only thing happening in the program and fighting for CPU time. If you are in any kind "multi-user" application (e.g. web application), then this is probably not true, and you may very well lose everything you though you would gain by multithreading.