Removal of duplicates from list: old approach vs stream design thought?

Question

Just a sample exploration of duplicate removal from the list of string. Old approach outperforms in terms of execution the new lambda/stream approach. What is the design thought behind the new approach and does it comes with other benefit compare to being performant?

List<String> nameList = new ArrayList<>();
Collections.addAll(nameList, "Raj","Nil",.......);
removeDupViaSet(nameList);
removeDupViaStream(nameList);

private static void removeDupViaStream(List<String> nameList) {
    long start = System.nanoTime();
    List<String> nm = nameList.stream().distinct().collect(Collectors.toList());
    long end = System.nanoTime() - start;
    System.out.println("Dup Removed via Stream : " + end);
}


private static void removeDupViaSet(List<String> nameList) {
    long start = System.nanoTime();
    Set<String> tempHashSet = new HashSet<>();
    tempHashSet.addAll(nameList);
    nameList.clear();
    nameList.addAll(tempHashSet);
    long end = System.nanoTime() - start;
    System.out.println("Dup Removed via Set : " + end);
}

Dup Removed via Set : 1186909

Dup Removed via Stream : 67513136

Lambdas let you do the same thing, often in fewer lines of code. This is one reason for using them, if you want to avoid verbose pre Java 8 code patterns. — Tim Biegeleisen, Feb 16 '18 at 09:35
You can't measure performance that way - the stream version will *probably* be slightly slower due to the stream overhead but I doubt it would be 70x slower. See: https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java — assylias, Feb 16 '18 at 09:37
Not an explanation for the difference in runtime, but you are aware that the set-approach might scramble the order of the elements in the list, right? Using a LinkedHashSet might help, though. — tobias_k, Feb 16 '18 at 10:02

score 2 · Accepted Answer · answered Feb 16 '18 at 09:40

2

The contract for distinct() reads:

Returns a stream consisting of the distinct elements (according to Object.equals(Object)) of this stream.

So since it can only use equals and therefore cannot use hashcode you are comparing apples with oranges. Any algorithm that is allowed to use hashcode (such as HashSet) will certainly outperform distinct.

answered Feb 16 '18 at 09:40

OldCurmudgeon

64,482
16
119
213

I just tried this with a test class, incrementing some counters each time hashcode or equals gets called, and they get called in both cases, and about equally often (using stream, hashcode even gets called a bit more often, but not enough to explain the difference) – tobias_k Feb 16 '18 at 10:10
Anything allowed to use `equals` is also allowed to use `hashCode` as doing otherwise makes no sense... I can't see this confirmed anywhere in the doc, but I wouldn"t hesitate a second. – maaartinus Feb 18 '18 at 15:15

score 1 · Answer 2 · answered Feb 16 '18 at 09:42

Generally speaking, when using streams you won't get performance all the time. It is down to operation and data size e.g parallel streams on list of size 10 will not be efficient as compare to old style approach. It start performing well beyond size if 10k.

The best bet would be consider the operation and data size when evaluating performance.

Removal of duplicates from list: old approach vs stream design thought?

2 Answers2