4

Given a fixed list of strings at compile-time e.g.:

"zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"

Utilizing HashSet we have a very fast way (O(1)) to tell if a String provided at runtime is in the list of strings.

For example:

Set<String> SET = new HashSet<>(Arrays.asList( "zero", "one", "two", "three",
        "four", "five", "six", "seven", "eight", "nine"));

boolean listed = SET.contains("some-text");

Are there any other faster ways to tell if a String provided at runtime is in the initial list of strings (given at compile-time) or is HashSet the fastest solution?

Let's Formalize the Question

Given the interface below:

interface Checker {
    String[] VALUES = { "zero", "one", "two", "three", "four", "five", "six",
            "seven", "eight", "nine" };

    boolean contains(String s);
}

Provide the fastest possible implementation given that the values listed in Checker.VALUES will not be changed (so for example you can use those literals in your code if you want to).

Demonstration: HashSetChecker Implementation

The implementation which uses a HashSet would look like this:

class HashSetChecker implements Checker {
    private final Set<String> set = new HashSet<>(Arrays.asList(VALUES));
    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

Code to Test an Implementation

When testing, we want to test the Checker.contains() method with the initial interned strings, also with other interned strings (String literals) which will not be found, and also Strings that have the same values (are equals) but are not interned Strings. For this purpose the following CheckerTester.TESTS array will be used.

public class CheckerTester {
    private static final String[] TESTS = { "zero", "one", "two", "three",
            "four", "five", "six", "seven", "eight", "nine", "ten", "eleven",
            "twelve", "thirteen", "fourteen", "fifteen", "sixteen",
            "seventeen", "eighteen", "nineteen",

            new String("zero"), new String("one"), new String("two"),
            new String("three"), new String("four"), new String("five"),
            new String("six"), new String("seven"), new String("eight"),
            new String("nine"), new String("ten"), new String("eleven"),
            new String("twelve"), new String("thirteen"),
            new String("fourteen"), new String("fifteen"),
            new String("sixteen"), new String("seventeen"),
            new String("eighteen"), new String("nineteen") };

    public static void test(Checker checker) {
        final int N = 1_000_000;

        long start = System.nanoTime();

        for (int i = 0; i < N; i++)
            for (String test : TESTS)
                checker.contains(test);

        long end = System.nanoTime();

        System.out.printf("%s: %d ms\n", checker.getClass().getName(),
                (end - start) / 1_000_000);
    }
}
icza
  • 389,944
  • 63
  • 907
  • 827

3 Answers3

4

Let's see some implementations:

First idea: HashSet with big capacity

Some might say that multiple Strings might end up in the same HashSet bucket, so let's use a big initial capacity:

class HashSetChecker2 implements Checker {
    private final Set<String> set = new HashSet<>(1000);
    { set.addAll(Arrays.asList(VALUES)); }

    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

Using HashMap instead of HashSet

HashSet uses HashMap in its implementation, let's just do the same: get rid of the "shell" HashSet:

class HashMapChecker implements Checker {
    private final Map<String, Object> map = new HashMap<>(1000);
    {
        for (String s : VALUES)
            map.put(s, s);
    }
    @Override
    public boolean contains(String s) {
        return map.containsKey(s);
    }
}

TreeSet

Some might say give TreeSet a try too (it's sorted so it might have a chance). I know it's O(log(n)), but n is small (10 in this case):

class TreeSetChecker implements Checker {
    private final Set<String> set = new TreeSet<>(Arrays.asList(VALUES));
    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

Short-circuit OR checks (if-else logic):

class OrChecker implements Checker {
    @Override
    public boolean contains(String s) {
        return "zero".equals(s) || "one".equals(s) || "two".equals(s)
                || "three".equals(s) || "four".equals(s) || "five".equals(s)
                || "six".equals(s) || "seven".equals(s) || "eight".equals(s)
                || "nine".equals(s);
    }
}

Reference equality then revert to short-circuit OR checks

Some might say that we should first check if by reference we have the String and if not then revert to short-circuit OR checks:

class RefOrChecker extends OrChecker {
    @Override
    public boolean contains(String s) {
        return "zero" == s || "one" == s || "two" == s || "three" == s
                || "four" == s || "five" == s || "six" == s || "seven" == s
                || "eight" == s || "nine" == s || super.contains(s);
    }
}

Using switch: fastest I've found so far

Since we have a fixed list of Strings known at compile-time, we can take advantage of the possibility that we can use Strings in switch statements.

We can add a case for each String from the fixed list and return true, and add a default case to return false.

class SwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        switch (s) {
            case "zero":
            case "one":
            case "two":
            case "three":
            case "four":
            case "five":
            case "six":
            case "seven":
            case "eight":
            case "nine":
                return true;
            default:
                return false;
        }
    }
}

New finding: Embedded switches (2 switch blocks)

Maaartinus's answer about perfect hashing made me thinking. Even if we have a perfect hash, it still has to run on the whole content of the String provided at runtime which we want to check. So instead we should use something that is available right in the String: its length. Based on the length of the String we use a switch, and inside that switch we use an internal switch only listing the strings with the specified length. With this, we reduce the number of case statements inside a switch:

class EmbeddedSwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        switch (s.length()) {
        case 3:
            switch (s) {
                case "one":
                case "two":
                case "six":
                    return true;
                default:
                    return false;
            }
        case 4:
            switch (s) {
                case "zero":
                case "four":
                case "five":
                case "nine":
                    return true;
                default:
                    return false;
            }
        case 5:
            switch (s) {
                case "three":
                case "seven":
                case "eight":
                    return true;
                default:
                    return false;
            }
        default:
            return false;
        }
    }
}

New finding: CharSwitchChecker: the Champion

This is basically the combination of an improved EmbeddedSwitchChecker and OldCurmudgeon's idea about a state machine: here we use the switch on the first character of the String (but first we check its length), and based on that we either narrowed down to one possible String or if not, we check the 2nd character too in which case the possible String can be only one (and we can decide by calling String.equals()):

class CharSwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        final int length = s.length();
        if (length < 3 || length > 5)
            return false;

        switch (s.charAt(0)) {
        case 'z':
            return "zero".equals(s);
        case 'o':
            return "one".equals(s);
        case 't':
            return s.charAt(1) == 'w' ? "two".equals(s) : "three".equals(s);
        case 'f':
            return s.charAt(1) == 'o' ? "four".equals(s) : "five".equals(s);
        case 's':
            return s.charAt(1) == 'i' ? "six".equals(s) : "seven".equals(s);
        case 'e':
            return "eight".equals(s);
        case 'n':
            return "nine".equals(s);
        }
        return false;
    }
}

Test Results

Here are the test results:

                         TIME        HOW FAST (compared to HashSetChecker)
-----------------------------------------------------------------------------
HashSetChecker:          929 ms       1.00x
HashSetChecker2:         892 ms       1.04x
HashMapChecker:          873 ms       1.06x
TreeSetChecker:         2265 ms       0.41x
OrChecker:              1815 ms       0.51x
RefOrChecker:           1708 ms       0.54x
SwitchChecker:           538 ms       1.73x
EmbeddedSwitchChecker:   467 ms       1.99x
CharSwitchChecker:       436 ms       2.13x

The SwitchChecker solution is about 1.7 times faster, the EmbeddedSwitchChecker is 2 times faster and the champion CharSwitchChecker is about 2.13 times faster than the HashSetChecker implementation. As expected the HashSet with big initial capacity and the HashMap solutions are slightly faster, and all other solutions fall behind.

Complete Runnable Test Program (+All Listed Solutions)

The Complete Runnalbe Test Program plus All listed solutions are here in one box for those who want to try it or experiment with new implementations.

Edit: Following Luiggi Mendoza's suggestion for rules for performing a micro benchmark, I changed the main() method for testing. I execute the whole test twice, and I only analyze the 2nd result. Also since the tests don't create new objects in the loop, I see no reason to call System.gc() whatsoever.

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;

interface Checker {
    String[] VALUES = { "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine" };

    boolean contains(String s);
}

class HashSetChecker implements Checker {
    private final Set<String> set = new HashSet<>(Arrays.asList(VALUES));

    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

class HashSetChecker2 implements Checker {
    private final Set<String> set = new HashSet<>(1000);
    {
        set.addAll(Arrays.asList(VALUES));
    }

    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

class HashMapChecker implements Checker {
    private final Map<String, Object> map = new HashMap<>(1000);
    {
        for (String s : VALUES)
            map.put(s, s);
    }

    @Override
    public boolean contains(String s) {
        return map.containsKey(s);
    }
}

class TreeSetChecker implements Checker {
    private final Set<String> set = new TreeSet<>(Arrays.asList(VALUES));

    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

class OrChecker implements Checker {
    @Override
    public boolean contains(String s) {
        return "zero".equals(s) || "one".equals(s) || "two".equals(s) || "three".equals(s)
                || "four".equals(s) || "five".equals(s) || "six".equals(s) || "seven".equals(s)
                || "eight".equals(s) || "nine".equals(s);
    }
}

class RefOrChecker extends OrChecker {
    @Override
    public boolean contains(String s) {
        return "zero" == s || "one" == s || "two" == s || "three" == s || "four" == s || "five" == s
                || "six" == s || "seven" == s || "eight" == s || "nine" == s || super.contains(s);
    }
}

class SwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        switch (s) {
        case "zero":
        case "one":
        case "two":
        case "three":
        case "four":
        case "five":
        case "six":
        case "seven":
        case "eight":
        case "nine":
            return true;
        default:
            return false;
        }
    }
}

class EmbeddedSwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        switch (s.length()) {
        case 3:
            switch (s) {
            case "one":
            case "two":
            case "six":
                return true;
            default:
                return false;
            }
        case 4:
            switch (s) {
            case "zero":
            case "four":
            case "five":
            case "nine":
                return true;
            default:
                return false;
            }
        case 5:
            switch (s) {
            case "three":
            case "seven":
            case "eight":
                return true;
            default:
                return false;
            }
        default:
            return false;
        }
    }
}

class CharSwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        final int length = s.length();
        if (length < 3 || length > 5)
            return false;

        switch (s.charAt(0)) {
        case 'z':
            return "zero".equals(s);
        case 'o':
            return "one".equals(s);
        case 't':
            return s.charAt(1) == 'w' ? "two".equals(s) : "three".equals(s);
        case 'f':
            return s.charAt(1) == 'o' ? "four".equals(s) : "five".equals(s);
        case 's':
            return s.charAt(1) == 'i' ? "six".equals(s) : "seven".equals(s);
        case 'e':
            return "eight".equals(s);
        case 'n':
            return "nine".equals(s);
        }
        return false;
    }
}

public class CheckerTester {
    private static final String[] TESTS = { "zero", "one", "two", "three", "four", "five", "six", "seven",
            "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen",
            "seventeen", "eighteen", "nineteen",

            new String("zero"), new String("one"), new String("two"), new String("three"),
            new String("four"), new String("five"), new String("six"), new String("seven"),
            new String("eight"), new String("nine"), new String("ten"), new String("eleven"),
            new String("twelve"), new String("thirteen"), new String("fourteen"), new String("fifteen"),
            new String("sixteen"), new String("seventeen"), new String("eighteen"), new String("nineteen") };

    public static void test(Checker checker) {
        final int N = 1_000_000;

        long start = System.nanoTime();

        for (int i = 0; i < N; i++)
            for (String test : TESTS)
                checker.contains(test);

        long end = System.nanoTime();

        System.out.printf("%s: %d ms\n", checker.getClass().getName(), (end - start) / 1_000_000);
    }

    public static void main(String args[]) {
        for (int i = 1; i <= 2; i++) {
            System.out.println("---- Check #" + i);
            test(new HashSetChecker());
            test(new HashSetChecker2());
            test(new HashMapChecker());
            test(new TreeSetChecker());
            test(new OrChecker());
            test(new RefOrChecker());
            test(new SwitchChecker());
            test(new EmbeddedSwitchChecker());
            test(new CharSwitchChecker());
        }
    }

}
Community
  • 1
  • 1
icza
  • 389,944
  • 63
  • 907
  • 827
  • 3
    The only downside I find in your test is that it doesn't comply with the [rules for performing a micro benchmark](http://stackoverflow.com/a/513259/1065197). Please change the execution to use [JMH](http://openjdk.java.net/projects/code-tools/jmh/) or some framework over JUnit like [caliper](https://code.google.com/p/caliper/). – Luiggi Mendoza Oct 09 '14 at 21:36
  • 2
    Sadly these benchmarks are invalid. They are therefore mostly useless. – Boris the Spider Oct 09 '14 at 21:38
  • @LuiggiMendoza Well, my intention with testing was just to tell if an implementation is faster or slower than using `HashSet`. That much it can tell. – icza Oct 09 '14 at 21:38
  • Nope, you cannot tell that. That is a massive fallacy of Java microbenching that you have fallen into. You are mostly timing the JIT. Please **always use Caliper** for microbenching. – Boris the Spider Oct 09 '14 at 21:40
  • I understand that. But the way you get and compare the results fails horribly. This is because in the latest methods JIT will kick in and the code would be optimized. Please refer to the link I provided in my first comment to know how to update the `public static void main` to get more precise results. – Luiggi Mendoza Oct 09 '14 at 21:41
  • Here are [some Caliper results](https://microbenchmarks.appspot.com/runs/32a9807a-10dc-467e-8450-ce1c80776c65#r:scenario.benchmarkSpec.methodName) for your reference. As expected the `switch` is the fastest as it is a compile time construct whereas the hashes are all runtime constructs. I didn't bother with the `TreeSet` as it's `O(lg n)` rather than `O(1)`. I also didn't bother with the `HashMap` as a `HashSet` _is a_ `HashMap`. – Boris the Spider Oct 09 '14 at 22:34
  • Yup, if you change the order of the tests you'll see different results – Steve Kuo Oct 09 '14 at 22:51
  • I changed the testing according to tips given in [rules for performing a micro benchmark](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java/513259#513259), now they are what we would expect. – icza Oct 10 '14 at 06:18
  • Without using Caliper, these benchmarks are still unreliable. Caliper will instrument the JVM and ensure that no JIT compilations takes place during the tests. With tests running for several seconds - as you have - it is likely that completely unrelated thinks will begin to get JITted in the background, not to mention a GC will have to happen once of twice during the testing. Although these numbers are slightly more reliable, I certainly wouldn't make any conclusions off of them or consider them good enough to post as an answer. – Boris the Spider Oct 10 '14 at 06:43
  • @BoristheSpider Yes, I agree with you. Altough GC is not in play in my case because no Objects are created in the loops. The Objects that are created are really just the `Checker` implementations and optionally a `HashSet` or `HashMap` for some of them. The Strings that are checked are also not created in the loop but only once at the start of the application. But nonetheless Caliper should be used to qualify or correct the results. – icza Oct 10 '14 at 06:52
2

The fastest solution could be based on perfect hashing. Find a fast hash function mapping all your strings onto distinct integers in a small range and build a hash table. The hash table is fast as there are by definition no collisions. Finding the perfect hash function may take long and finding a fast one may be even harder.

One example where it works pretty well is Guava's CharMatcher.WHITESPACE, where all whitespace characters get mapped onto the set {0, ..., 31} by using a multiplication and a shift (see here for some explanation). The previous implementation used perfect hashing, too, but was way slower because of division.

Finding a fast perfect hash for your 10 strings and a table of size 16 should be rather simple.

Community
  • 1
  • 1
maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • A `switch` will still be faster as it will calculate the jump table at compile time and hardcode it in the compiled bytecode. Any runtime construct will require more instructions to be called. – Boris the Spider Oct 10 '14 at 06:40
  • @BoristheSpider Maybe... sometimes... but rather not. The switch could be faster if it used something like that, but it doesn't. The bytecode is irrelevant, the native code contains AFAIK also a jump table, and this is a real pain for the pipeline. Look at my 2nd link, there's no jump table, just a data table. I'd be curious, how a `switch` would perform, do you want to [give it a try](http://code.google.com/p/guava-libraries/source/browse/guava-tests/benchmark/com/google/common/base/WhitespaceMatcherBenchmark.java)? – maaartinus Oct 10 '14 at 07:54
1

One improvement on @icza's excellent suggestions is a state machine. Here's an implementation that looks like it may be quite efficient. Perhaps @icza will incorporate it into his time tests and we'll see how it fares.

Essentially it builds a static tree structure that can be traversed taking one step per character of the test string. If at any time the required step is not in the tree we can signal a mis-match. If we get to the end of the string we then check if this ending is legal.

This should be a O(k) algorithm (if there was such a thing) as it's run-time is linear to the length of the input string but there is clearly some set-up time.

public class Test {

    interface Checker {

        Set<String> VALUES = new HashSet<>(Arrays.asList("zero", "one", "two", "three", "four", "five", "six",
                "seven", "eight", "nine"));

        boolean contains(String s);
    }

    public static class HatChecker implements Checker {

        // Can't think of a name.
        static class Hats {

            // All possible children.
            Hats[] hats = new Hats[256];
            // Are we at the end of a word.
            boolean end = false;
        }

        // Root hats - contains one entry fr each possible fisrt characetr.
        static Hats root = new Hats();

        /**
         * Where should it go?
         */
        private static Hats find(String s, boolean grow) {
            Hats hats = root;
            for (int i = 0; i < s.length(); i++) {
                int ch = s.charAt(i);
                Hats newHats = hats.hats[ch];
                // Not seen this sequence yet?
                if (newHats == null) {
                    if (grow) {
                        // Allowed to grow.
                        newHats = hats.hats[ch] = new Hats();
                    } else {
                        // No growing - stop here.
                        return null;
                    }
                }
                hats = newHats;
            }
            return hats;
        }

        /**
         * Add to the structures.
         */
        private static void add(String s) {
            // Grow it and margk it good.
            find(s, true).end = true;
        }

        static {
            // Grow my structure.
            for (String s : VALUES) {
                add(s);
            }
        }

        @Override
        public boolean contains(String s) {
            // Find where it should be but don't grow.
            Hats found = find(s, false);
            // It's a match if it wa sthere and was an end.
            return found != null && found.end;
        }

    }

    private static class Check {

        private final String s;
        private final boolean matches;

        public Check(String s) {
            this.s = s;
            this.matches = Checker.VALUES.contains(s);
        }

        public String toString() {
            return "(" + s + ")=" + matches;
        }
    }
    private static final Check[] TESTS = {
        new Check("zero"),
        new Check("one"),
        new Check("two"),
        new Check("three"),
        new Check("four"),
        new Check("five"),
        new Check("six"),
        new Check("seven"),
        new Check("eight"),
        new Check("nine"),
        new Check("ten"),
        new Check("eleven"),
        new Check("twelve"),
        new Check("thirteen"),
        new Check("fourteen"),
        new Check("fifteen"),
        new Check("sixteen"),
        new Check("seventeen"),
        new Check("eighteen"),
        new Check("nineteen"),
        new Check(new String("zero")),
        new Check(new String("one")),
        new Check(new String("two")),
        new Check(new String("three")),
        new Check(new String("four")),
        new Check(new String("five")),
        new Check(new String("six")),
        new Check(new String("seven")),
        new Check(new String("eight")),
        new Check(new String("nine")),
        new Check(new String("ten")),
        new Check(new String("eleven")),
        new Check(new String("twelve")),
        new Check(new String("thirteen")),
        new Check(new String("fourteen")),
        new Check(new String("fifteen")),
        new Check(new String("sixteen")),
        new Check(new String("seventeen")),
        new Check(new String("eighteen")),
        new Check(new String("nineteen"))};

    public void timeTest(Checker checker) {
        System.out.println("Time");
        final int N = 1_000_000;

        long start = System.nanoTime();

        for (int i = 0; i < N; i++) {
            for (Check check : TESTS) {
                checker.contains(check.s);
            }
        }

        long end = System.nanoTime();

        System.out.printf("%s: %d ms\n", checker.getClass().getName(),
                (end - start) / 1_000_000);
    }

    public void checkerTest(Checker checker) {
        System.out.println("Checker");
        for (Check check : TESTS) {
            if (checker.contains(check.s) != check.matches) {
                System.err.println("Check(" + check + ") failed");
            }
        }
    }

    public static void main(String args[]) {
        try {
            Checker checker = new HatChecker();
            Test test = new Test();
            test.checkerTest(checker);
            test.timeTest(checker);
        } catch (Throwable ex) {
            ex.printStackTrace(System.err);
        }
    }
}

I suspect this will be on par with the case statement - It should be interesting though.

Sorry about the Hat naming - I just couldn't thing of a good name.

OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213
  • Good idea. Tested it and it was quite fast, even faster than the one using `HashSet` but was slower than the `switch` implementation. – icza Oct 10 '14 at 05:23
  • 1
    I don't think anything will beat a `switch` because it is compile time. A `switch` is essentially a jump table - any runtime code will require several method invocations. – Boris the Spider Oct 10 '14 at 06:39
  • @BoristheSpider Yes, the `switch` might be a jump table generated at compile time, but still the JVM has to "analyze" the `String` provided at runtime to decide where to jump. My `CharSwitchChecker` implementation does not use "String in switch" and is still faster (but yes, it still uses a `switch`). – icza Oct 10 '14 at 07:30
  • I would guess that the compiler will be allowed to do quite a lot of special work behind the scenes to make string switches performant. I would probably find a way to lock down their `intern` values so you could achieve a case switch in perhaps one or maybe two lookups. – OldCurmudgeon Oct 10 '14 at 09:20