Search in large CSV as fast as Guava Splitter

Question

Since Java 8 was released I found out I don't need over 2 MB Google Guava in my projects since I can replace most of it with plain Java. However I really liked nice Splitter API which was both quite fast at the same time. And what is most important - did splitting lazily. It seems to be replaceable with Pattern.splitAsStream. So I prepared quick test - finding a value in the middle of long string (i.e. splitting the whole string does not make sense).

package splitstream;


import com.google.common.base.Splitter;
import org.junit.Assert;
import org.junit.Test;

import java.util.StringTokenizer;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

public class SplitStreamPerfTest {

    private static final int TIMES = 1000;
    private static final String FIND = "10000";

    @Test
    public void go() throws Exception {
        final String longString = IntStream.rangeClosed(1,20000).boxed()
                .map(Object::toString)
                .collect(Collectors.joining(" ,"));

        IntStream.rangeClosed(1,3).forEach((i) -> {
            measureTime("Test " + i + " with regex", () -> doWithRegex(longString));
            measureTime("Test " + i + " with string tokenizer", () -> doWithStringTokenizer(longString));
            measureTime("Test " + i + " with guava", () -> doWithGuava(longString));
        });

    }

    private void measureTime(String name, Runnable r) {
        long s = System.currentTimeMillis();
        r.run();
        long elapsed = System.currentTimeMillis() - s;
        System.out.println("Check " + name +" took " + elapsed + " ms");
    }

    private void doWithStringTokenizer(String longString) {

        String f = null;
        for (int i = 0; i < TIMES; i++) {
            StringTokenizer st = new StringTokenizer(longString,",",false);
            while (st.hasMoreTokens()) {
                String t = st.nextToken().trim();
                if (FIND.equals(t)) {
                    f = t;
                    break;
                }
            }
        }
        Assert.assertEquals(FIND, f);
    }


    private void doWithRegex(String longString) {
        final Pattern pattern = Pattern.compile(",");
        String f = null;
        for (int i = 0; i < TIMES; i++) {
            f = pattern.splitAsStream(longString)
                    .map(String::trim)
                    .filter(FIND::equals)
                    .findFirst().orElse("");
        }
        Assert.assertEquals(FIND, f);
    }


    private void doWithGuava(String longString) {
        final Splitter splitter = Splitter.on(',').trimResults();
        String f = null;
        for (int i = 0; i < TIMES; i++) {
            Iterable<String> iterable = splitter.split(longString);
            for (String s : iterable) {
                if (FIND.equals(s)) {
                    f = s;
                    break;
                }
            }
        }
        Assert.assertEquals(FIND, f);
    }
}

The results are (after a warm-up)

Check Test 3 with regex took 1359 ms
Check Test 3 with string tokenizer took 750 ms
Check Test 3 with guava took 594 ms

How to make the Java implementation as fast as Guava? Maybe I'm doing it wrong?

Or maybe you know any tool/library as fast as Guava Splitter that does not involve pulling tons of unused classes just for this one?

I thought that pattern `\s*,\s*` might be faster than `Splitter.on(',').trimResults()` but it was even slower so I removed it. — Piotr Gwiazda, Jul 09 '17 at 22:13
Did you warm-up JVM? You'd better use [jmh](http://openjdk.java.net/projects/code-tools/jmh/) for such benchmarks, not one JUnit test. Also both Pattern and Splitter objects can be constants. What's important here is probably the fact that `Splitter#trimResults` uses `CharMatcher` internally which can be more efficient than `String::trim` because the latter allocates new array each time. Finally, regex can be slower than matching chars sequentially. — Grzegorz Rożniecki, Jul 10 '17 at 00:35
What's 2MB? Especially when you can ProGuard away unused classes. — Louis Wasserman, Jul 10 '17 at 03:18
After three runs the results became constant so this should be good for warm up. Thanks for the trim hint. It still means you can't ignore Guava after Java 8. Baybe rebuilding a light version might be a solution. — Piotr Gwiazda, Jul 10 '17 at 05:01
@PiotrGwiazda follow Louis' advice and use [ProGuard](https://github.com/google/guava/wiki/UsingProGuardWithGuava) to shrink Guava's size to your needs. — Olivier Grégoire, Jul 10 '17 at 10:56
Can I use ProGuard before adding to my project so that ctrl+space don't show me two `Optional` classes, `Function` classes etc? — Piotr Gwiazda, Jul 10 '17 at 13:18

Eugene · Accepted Answer · 2017-07-10T14:04:02.933

First thing is that guava is so much more than just the Splitter, Predicate and Function - you are probably not using everything it has to offer; we use it hardcore and just hearing that makes me shiver. Anyhow, you tests are broken - in probably numerous ways. I've used JMH to test these two method just for the fun of it:

    @BenchmarkMode(org.openjdk.jmh.annotations.Mode.AverageTime) 
    @OutputTimeUnit(TimeUnit.NANOSECONDS) 
    @Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)   
    @Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS) 
    @State(Scope.Thread) public class GuavaTest {

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder().include(GuavaTest.class.getSimpleName())
                .jvmArgs("-ea", "-Xms10g", "-Xmx10g")
                .shouldFailOnError(true)
                .build();
        new Runner(opt).run();
    }

    @Param(value = { "300", "1000" })
    public String tokenToSearchFor;

    @State(Scope.Benchmark)
    public static class ThreadState {
        String longString = IntStream.range(1, 20000).boxed().map(Object::toString).collect(Collectors.joining(" ,"));

        StringTokenizer st = null;

        Pattern pattern = null;

        Splitter splitter = null;

        @Setup(Level.Invocation)
        public void setUp() {
            st = new StringTokenizer(longString, ",", false);
            pattern = Pattern.compile(",");
            splitter = Splitter.on(',').trimResults();
        }
    }

    @Benchmark
    @Fork(1)
    public boolean doWithStringTokenizer(ThreadState ts) {
        while (ts.st.hasMoreTokens()) {
            String t = ts.st.nextToken().trim();
            if (t.equals(tokenToSearchFor)) {
                return true;
            }
        }
        return false;
    }

    @Benchmark
    @Fork(1)
    public boolean doWithRegex(ThreadState ts) {
        return ts.pattern.splitAsStream(ts.longString)
                .map(String::trim)
                .anyMatch(tokenToSearchFor::equals);
    }

    @Benchmark
    @Fork(1)
    public boolean doWithGuava(ThreadState ts) {
        Iterable<String> iterable = ts.splitter.split(ts.longString);
        for (String s : iterable) {
            if (s.equals(tokenToSearchFor)) {
                return true;
            }
        }
        return false;
    }

}

And the results:

Benchmark                        (tokenToSearchFor)  Mode  Cnt       Score        Error  Units
GuavaTest.doWithGuava                           300  avgt    5   19284.192 ±  23536.321  ns/op
GuavaTest.doWithGuava                          1000  avgt    5   67182.531 ±  93242.266  ns/op
GuavaTest.doWithRegex                           300  avgt    5   65780.954 ± 169044.641  ns/op
GuavaTest.doWithRegex                          1000  avgt    5  182530.069 ± 409571.222  ns/op
GuavaTest.doWithStringTokenizer                 300  avgt    5   34111.030 ±  61014.332  ns/op
GuavaTest.doWithStringTokenizer                1000  avgt    5  118963.048 ± 165510.183  ns/op

That makes guava the fastest indeed.

If you add parallel to the splitAsStream then it will become interesting, a must read here

I think that re-using StringTokenizer is not correct. It's stateful so once you find an element in 1st run then in each run it returns current element. — Piotr Gwiazda, Jul 10 '17 at 13:22
Actually I see that `StringTokenizer` does the heavy lifting in constructor - finding the delimiters. Actually I'm not sure if this is still a lazy solution. `StreamTokenizer` would be lazy or `Scanner`. — Piotr Gwiazda, Jul 10 '17 at 13:29
@PiotrGwiazda ah, so each run of the benchmark would need a new instance of `StringTokenizer`; but then this burden would need to be added to each test - those that use `Splitter` and `Pattern`... — Eugene, Jul 10 '17 at 14:01
@PiotrGwiazda edited. guava does "win", but no one said that regex matching is faster then plain character matching. We are comparing different things here. btw guava 21 is java-8 compatible, you can use lambda and functions with it. — Eugene, Jul 10 '17 at 14:06
Scanner is so ridiculously slow that I won't even post results. Like 10 times slower than Guava. Answer - say sorry to Guava, and get used to it. — Piotr Gwiazda, Jul 11 '17 at 08:52
@PiotrGwiazda there are a few things... first `jmh` is the de-facto tool to measure, but its easy to get something wrong (I've been through their examples like 100 times - still I feel like I know very little). It's also interesting why you measure this - for personal interest? The regex btw takes `0.06` millisecond - that's pretty darn fast to me... — Eugene, Jul 11 '17 at 08:58
I'm measuring it to get the idea why somebody put so much effort in implementing CharMatcher etc. Also in Java 8 world I'm using like 5% of Guava and I'm looking for replacements to tidy my classpath. It turned out that I'm using just Splitter, Preconditions and String utils. The collection and immutable collections have crazy API and I've already refactored them out carefully. — Piotr Gwiazda, Jul 11 '17 at 09:12

score 0 · Answer 2 · answered Jul 10 '17 at 00:28

0

This might be useful, you could import just the parts you need in guava: https://github.com/google/guava/wiki/UsingProGuardWithGuava

answered Jul 10 '17 at 00:28

Ohad Rubin

460
3
13

Ashish Lohia · Answer 3 · 2017-07-11T06:50:28.983

Can you give pattern.split(text) and iterating over the result in a normal for loop, a try. It might be faster than stream. Though am not sure if it will beat Guava.

I meant this..

private void doWithRegexAndSplit(String longString) {
        final Pattern pattern = Pattern.compile(",");
        for (int i = 0; i < TIMES; i++) {
         String f = "";
         String[] arr = pattern.split(longString);
            for (int i = 0; i < arr.length; i++){
                String t= arr[i].trim();
                if (FIND.equals(t)) {
                f = t;
                break;
                }
            }       
        }
        Assert.assertEquals(FIND, f);
    }

Please check the time of completion for this case.

It's way slower and memory ineffective. See that the challenge is that there are 20,000 numbers in CSV but you want to stop in the middle and don't split further. — Piotr Gwiazda, Jul 10 '17 at 13:20

score 0 · Answer 4 · answered Jul 11 '17 at 13:57

You are comparing Pattern.splitAsStream(CharSequence) to Splitter.split(CharSequence) on a Splitter.on(char) instead of on a Splitter.onPattern(String). Finding matches to a char is computationally much simpler than finding matches to a pattern (regex).

If you use Splitter.onPattern(",").trimResults() then you will get results like the following:

Check Test 3 with regex took 608 ms
Check Test 3 with string tokenizer took 403 ms
Check Test 3 with guava took 306 ms
Check Test 3 with guava pattern took 689 ms

In which case Pattern.splitAsStrimg(CharSequence) actually performs better than Guava's implementation (assuming this is a valid benchmark, which is always questionable because we're not using jmh).

I am not aware of any JDK char delimited splitting solution similar to Guava's Splitter.on(char).split(CharSequence). You could roll your own but Guava's solution appears to be very optimized.

The point is that Java JDK is missing effective char splitting. — Piotr Gwiazda, Jul 12 '17 at 07:40

Search in large CSV as fast as Guava Splitter

4 Answers4