Java: Performance Optimisation Beyond HashSet.contains()?

Question

For the following code, I get an average computation time of 50 ms. How can I optimize

filter(u -> myStrings.contains(u.getName())

to get faster computation time?

list size 3000000, averaged 100 times
LongSummaryStatistics{count=100, sum=5135, min=46, average=51,350000, max=147}

Remark: the class User2 contains other attributes, too (plus getters, setters).

code:

public class Temp2 {
    static HashSet<String> myStrings = new HashSet<>();
    static long test1(List<User2> user2s) {
        long time1 = System.currentTimeMillis();
        ArrayList<User2> collect = user2s.stream()
                .filter(u -> myStrings.contains(u.getName()))
                .collect(Collectors.toCollection(ArrayList::new));
        long time2 = System.currentTimeMillis();
        return time2 - time1;
    }
    static class User2 {
        String name;
        public User2(String name) {this.name = name;}
        public String getName() {return name;}
        public void setName(String name) {this.name = name;}
    }
    public static void main(String... args) {
        for (int i = 0; i < 15; i++) {myStrings.add(getRandomString());}

        int size = 3_000_000;
        List<User2> user2s = new ArrayList<>();
        for (int i = 0; i < size; i++) {
            user2s.add(new User2(getRandomString()));
        }
        repeat("test:", user2s, Temp2::test1, 100);
    }
    private static void repeat(String name, List<User2> user2s, ToLongFunction<List<User2>> test, int iterations) {
        System.out.println("list size " + user2s.size() + ", averaged " + iterations + " times");
        System.out.println(
                IntStream.range(0, iterations)
                        .mapToLong(i -> test.applyAsLong(user2s))
                        .summaryStatistics());
    }
    private static String getRandomString() {
        SecureRandom random = new SecureRandom();
        return (new BigInteger(130, random).toString(32)).substring(0,12);
    }
}

http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java — Andy Turner, Feb 28 '17 at 08:53
I would test whether using a plain old for loop, adding to a pre-sized `ArrayList`, would be faster. — Andy Turner, Feb 28 '17 at 08:54
What is the code actually trying to do? Calling `Map.contains()` is usually a sign of a pre-existing problem. — user207421, Feb 28 '17 at 09:17
I am asking this as an open question since I don’t know. I have read a bit about parallel streams. Might they help here if the asker has a CPU with multiple cores (which I consider likely)? — Ole V.V., Feb 28 '17 at 09:18
@OleV.V. using 'parallelStream()' roughly doubles the performance on my four core machine. — , Feb 28 '17 at 09:25
@EJP this class is a simplification of a much more complex case. The question is: what is the fastest way to find items in a list with some million of entries — , Feb 28 '17 at 09:46
Well the "bottleneck" is likely `contains` on `HashSet`, and that's pretty much optimal. Without knowing more about the problem it will be hard to recommend something better, but there might be O(1) solutions if you don't mind some false positives. — john16384, Feb 28 '17 at 10:27
If @john16384 is correct (which sounds likely in my ear), one thing to consider is whether you have a good `hashCode` method. If what you are looking up is strings (as in the question), there’s probably no good point in trying to better than `String.hashCode()`; but if in your real world code it’s some class of your own, you may want to just check. — Ole V.V., Feb 28 '17 at 11:31
@OleV.V. in my real world code, it is 'String', yes. Could javolution ( http://javolution.org/ ) be a solution for further improvements? — , Feb 28 '17 at 14:32

bashnesnos · Answer 1 · 2017-03-01T15:23:33.677

As other fellow stackoverflower's pointed - the bottleneck is the hashCode method.

Particularly for a String - to compute a hashCode you would need to traverse the string. So the longer String is - the worse it gets, assuming you have a lot of unique strings. Luckily in your example they are quite short - only 12 chars.

Judging by a JMH microbenchmark on my machine (Core-i3 6100U 2.3 GHz) here is the timings per single operation for different implementations of contains:

58ns - HashSet<String>
47ns - HashSet<Integer>
35ns - BitSet

In your case it means, that for a 3000000 list it will take approximately 174ms, 148ms and 105ms respectively on my machine. So you could gain up to ~1.5x speed up by changing a comparison approach.

So you if you make an additional fields in your User2, that will hold a transformed representaion of name which could suit something different than HashSet<String> - you could benefit from there. Taking in consideration, that you have only 15 myStrings - a BitSet could be quite suitable there.

Apart from that, parallelStream really doubles the performance - so if you have cores to spend - that would be the best option.

P.S. the benchmark code:

import org.openjdk.jmh.annotations.*;

import java.math.BigInteger;
import java.security.SecureRandom;
import java.util.*;
import java.util.concurrent.TimeUnit;

@State(Scope.Benchmark)
public class ProperBench {

    private HashSet<String> myStrings;
    private String randomName;

    private HashSet<Integer> myInts;
    private Integer randomInt;

    private BitSet myBits;
    private int randomBit;

    List<User2> user2s = new ArrayList<>();

    private static final int MY_STRINGS_SIZE = 15;

    private class User2 {
        String name;
        public User2(String name) {this.name = name;}
        public String getName() {return name;}
        public void setName(String name) {this.name = name;}
    }


    private String getRandomString() {
        SecureRandom random = new SecureRandom();
        return (new BigInteger(130, random).toString(32)).substring(0,12);
    }


    @Setup(Level.Invocation)
    public void setUpIteration() {
        myStrings = new HashSet<>();
        myInts = new HashSet<>();
        myBits = new BitSet(MY_STRINGS_SIZE);

        SecureRandom random = new SecureRandom();

        for (int i = 0; i < MY_STRINGS_SIZE; i++) {
            String aString = getRandomString();
            myStrings.add(aString);
            myInts.add(aString.hashCode());
            if (random.nextBoolean()) myBits.set(i);
        }

        randomName = getRandomString();
        randomInt = randomName.hashCode();
        randomBit = random.nextInt(MY_STRINGS_SIZE);
    }

    @Benchmark
    @BenchmarkMode(Mode.SampleTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public boolean test1() {
        return myStrings.contains(randomName);
    }

    @Benchmark
    @BenchmarkMode(Mode.SampleTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public boolean test2() {
        return myInts.contains(randomInt);
    }

    @Benchmark
    @BenchmarkMode(Mode.SampleTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    public boolean test3() {
        return myBits.get(randomBit);
    }

}

Java: Performance Optimisation Beyond HashSet.contains()?

1 Answers1