87

What is the best method of splitting a String into a Stream?

I saw these variations:

  1. Arrays.stream("b,l,a".split(","))
  2. Stream.of("b,l,a".split(","))
  3. Pattern.compile(",").splitAsStream("b,l,a")

My priorities are:

  • Robustness
  • Readability
  • Performance

A complete, compilable example:

import java.util.Arrays;
import java.util.regex.Pattern;
import java.util.stream.Stream;

public class HelloWorld {

    public static void main(String[] args) {
        stream1().forEach(System.out::println);
        stream2().forEach(System.out::println);
        stream3().forEach(System.out::println);
    }

    private static Stream<String> stream1() {
        return Arrays.stream("b,l,a".split(","));
    }

    private static Stream<String> stream2() {
        return Stream.of("b,l,a".split(","));
    }

    private static Stream<String> stream3() {
        return Pattern.compile(",").splitAsStream("b,l,a");
    }

}
shmosel
  • 49,289
  • 6
  • 73
  • 138
slartidan
  • 20,403
  • 15
  • 83
  • 131
  • 5
    Do you think that there is so much difference that it makes sense to spend time wondering about the "best" way, or have you identified this as a performance hotspot in your program making it at least somewhat justified to try to find the "best" way? – Kayaman Dec 02 '16 at 13:02
  • 6
    Note that `Stream.of()` will call `Arrays.stream()` internally, so clearly that's not the "best". – Kayaman Dec 02 '16 at 13:04
  • 2
    IMHO, the last one is best. No nested parens, clear that it's about regex, and it does not create an intermediate list or array of all segments. But that's just my opinion. – tobias_k Dec 02 '16 at 13:05
  • 2
    I don`t think we can particularly judge his reasoning for wanting to know this information. Though best is an opinion do you mean fastest ? – Tegra Detra Aug 21 '18 at 23:29
  • 1
    @Prospero OP clearly spelled out what he's looking for: robustness, readability, and performance. – shmosel Aug 21 '18 at 23:32
  • @shmosel Many times, readability and performance conflict each other. Although this question has been "answered" (not really - answer just exposes research the OP should have done themself), can't blame others for questioning his actual problem. EDIT: Just realized how old the comment was. Sorry for the tag notification. – Vince Oct 20 '19 at 00:45

3 Answers3

126

Arrays.stream/String.split

Since String.split returns an array String[], I always recommend Arrays.stream as the canonical idiom for streaming over an array.

String input = "dog,cat,bird";
Stream<String> stream = Arrays.stream(input.split( "," ));
stream.forEach(System.out::println);

Stream.of/String.split

Stream.of is a varargs method which just happens to accept an array, due to the fact that varargs methods are implemented via arrays and there were compatibility concerns when varargs were introduced to Java and existing methods retrofitted to accept variable arguments.

Stream<String> stream = Stream.of(input.split(","));     // works, but is non-idiomatic
Stream<String> stream = Stream.of("dog", "cat", "bird"); // intended use case

Pattern.splitAsStream

Pattern.compile(",").splitAsStream(string) has the advantage of streaming directly rather than creating an intermediate array. So for a large number of sub-strings, this can have a performance benefit. On the other hand, if the delimiter is trivial, i.e. a single literal character, the String.split implementation will go through a fast path instead of using the regex engine. So in this case, the answer is not trivial.

Stream<String> stream = Pattern.compile(",").splitAsStream(input);

If the streaming happens inside another stream, e.g. .flatMap(Pattern.compile(pattern) ::splitAsStream) there is the advantage that the pattern has to be analyzed only once, rather than for every string of the outer stream.

Stream<String> stream = Stream.of("a,b", "c,d,e", "f", "g,h,i,j")
    .flatMap(Pattern.compile(",")::splitAsStream);

This is a property of method references of the form expression::name, which will evaluate the expression and capture the result when creating the instance of the functional interface, as explained in What is the equivalent lambda expression for System.out::println and java.lang.NullPointerException is thrown using a method-reference but not a lambda expression

Holger
  • 285,553
  • 42
  • 434
  • 765
  • `.flatMap(Pattern.compile(pattern)` why has the pattern to be analyzed only once? Is there caching? If yes, is this written in the docs somewhere? – Roland Sep 25 '17 at 09:16
  • 21
    @Roland: that’s a property of method references of the form `expression::name`, the expression will get evaluated and the result captured by the created function instance. With `Pattern.compile(pattern)::splitAsStream`, the expression is `Pattern.compile(pattern)`. That’s fundamentally different to, e.g. `(string) -> Pattern.compile(pattern).splitAsStream(string)` which would reevaluate the pattern on each function evaluation. See [here](https://stackoverflow.com/q/37413106/2711488) and [here](https://stackoverflow.com/a/28025717/2711488)… – Holger Sep 25 '17 at 09:28
  • 1
    Super nice answer ! While looking under the hood it appears that Pattern#splitAsStream and Pattern#split use CharSequence#subSequence (calling Arrays.copyOfRange) to find each remaining match. Do you know any way to stream tokens of a String in a performance-oriented way ? I would avoid splitting Strings as in my case they are very-very-very big (some megabytes each) and i don't have the possibility to obtain them in another format. – Benj Aug 21 '18 at 14:12
  • 3
    @Benj the problem is that it is even calling `subSequence(…).toString()`, so using a different `CharSequence` implementation (i.e. a non-copying [`CharBuffer.wrap(string)`](https://docs.oracle.com/javase/8/docs/api/?java/nio/CharBuffer.html)) would only defer the copying to the subsequent `toString()` call. But for tokenizing, you’d rather want Java 9’s `Matcher.results()` or `Scanner.findAll(…)` anyway, so consider the backports of [this answer](https://stackoverflow.com/a/37482157/2711488). Since `MatchResult` provides `start()` and `end()`, you can build copy-free operations atop of it. – Holger Aug 21 '18 at 14:32
  • 3
    @Benj you may also consider the second half of [this answer](https://stackoverflow.com/a/48172590/2711488), which contains code examples of a tokenization which creates only a single `String` instance for each unique token, which dramatically reduces time and memory footprint when parsing something with lots of occurrences of common keywords. – Holger Aug 21 '18 at 14:41
  • @Holger thanks for the info, the two answers you pointed out are very instructive and answer my question. – Benj Aug 22 '18 at 08:54
  • 2
    It seems Guava's [Splitter](https://google.github.io/guava/releases/20.0/api/docs/com/google/common/base/Splitter.html) has an interesting way of doing this, i stumbled upon this more or less randomly. – Benj Aug 22 '18 at 12:24
2

Regarding (1) and (2) there shouldn't be much difference, as your code is almost the same.
Regarding (3), that would be much more effective in terms of memory (not necessarily CPU), but in my opinion, a bit harder to read.

Jin Kwon
  • 20,295
  • 14
  • 115
  • 184
Alexey Soshin
  • 16,718
  • 2
  • 31
  • 40
2

Robustness

I can see no difference in the robustness of the three approaches.

Readability

I am not aware of any credible scientific studies on code readability involving experienced Java programmers, so readability is a matter of opinion. Even then, you never know if someone giving their opinion is making an objective distinction between actual readability, what they have been taught about readability, and their own personal taste.

So I will leave it to you to make your own judgements on readability ... noting that you do consider this to be a high priority.

FWIW, the only people whose opinions on this matter are you and your team.

Performance

I think that the answer to that is to carefully benchmark the three alternatives. Holger provides an analysis based on his study of some versions of Java. But:

  1. He was not able to come to a definite conclusion on which was fastest.
  2. Strictly speaking, his analysis only applies to the versions of Java he looked at. (Some aspects of his analysis could be different on (say) Android Java, or some future Oracle / OpenJDK version.)
  3. The relative performance is likely depend on the length of the string being split, the number of fields, and the complexity of the separator regex.
  4. In a real application, the relative performance may also depend what you do with the Stream object, what garbage collector you have selected (since the different versions apparently generate different amounts of garbage), and other issues.

So if you (or anyone else) are really concerned with the performance, you should write a micro-benchmark and run it on your production platform(s). Then do some application specific benchmarking. And you should consider looking at solutions that don't involve streams.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • I agree to your point of view: it is hard to scientifically prove robustness, readability and Java performance. A minor note, though: readability should not only be a personal taste of the *current* team, because future team members will have to read the code, too. – slartidan Oct 21 '19 at 03:56
  • 1
    I tried to make it clear that readability and personal taste are not the same thing. – Stephen C Oct 21 '19 at 12:00