237

In Java 8, there is a new method String.chars() which returns a stream of ints (IntStream) that represent the character codes. I guess many people would expect a stream of chars here instead. What was the motivation to design the API this way?

Arend v. Reinersdorff
  • 4,110
  • 2
  • 36
  • 40
Adam Dyga
  • 8,666
  • 4
  • 27
  • 35
  • There are only 3 types of primitive streams: IntStream, LongStream and DoubleStream. CharStream doesn't exist. – JB Nizet Mar 16 '14 at 10:47
  • @JBNizet Perhaps OP meant `Stream`? – Rohit Jain Mar 16 '14 at 10:48
  • 4
    @RohitJain I didn't mean any particular stream. If `CharStream` doesn't exist what would be the problem to add it? – Adam Dyga Mar 16 '14 at 10:51
  • 5
    @AdamDyga: The designers explicitely chose to avoid the explosion of classes and methods by limiting the primitive streams to 3 types, since the other types (char, short, float) can be represented by their larger equivalent (int, double) without any significant performance penalty. – JB Nizet Mar 16 '14 at 10:56
  • 3
    @JBNizet I get it. But it still feels like a dirty solution just for the sake of saving a couple of new classes. – Adam Dyga Mar 16 '14 at 11:08
  • 10
    @JB Nizet: To me it looks like we already *have* an explosion of interfaces given all stream overloading as well as [all function interfaces](http://download.java.net/jdk8/docs/api/java/util/function/package-summary.html)… – Holger Mar 18 '14 at 09:06
  • 5
    Yes, there already is an explosion, even with only three primitive stream specializations. What would it be if all eight primitives had stream specializations? A cataclysm? :-) – Stuart Marks Mar 19 '14 at 21:25
  • 5
    Tangential, but I'd encourage people to prefer [`String.codePoints()`](https://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#codePoints--) over `.chars()` - the latter doesn't handle all Unicode characters the way you'd expect (it splits [surrogate pairs](https://stackoverflow.com/q/5903008/113632)). Unless you know for certain your string will never contain high-code-point characters you should avoid `.chars()`. – dimo414 Jun 25 '17 at 20:05
  • 1
    If you're looking for it, this method is precisely in **CharSequence**. See [Javadoc](https://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#chars--) – Guillaume Husta Apr 11 '18 at 08:49

2 Answers2

258

As others have already mentioned, the design decision behind this was to prevent the explosion of methods and classes.

Still, personally I think this was a very bad decision, and there should, given they do not want to make CharStream, which is reasonable, different methods instead of chars(), I would think of:

  • Stream<Character> chars(), that gives a stream of boxes characters, which will have some light performance penalty.
  • IntStream unboxedChars(), which would to be used for performance code.

However, instead of focusing on why it is done this way currently, I think this answer should focus on showing a way to do it with the API that we have gotten with Java 8.

In Java 7 I would have done it like this:

for (int i = 0; i < hello.length(); i++) {
    System.out.println(hello.charAt(i));
}

And I think a reasonable method to do it in Java 8 is the following:

hello.chars()
        .mapToObj(i -> (char)i)
        .forEach(System.out::println);

Here I obtain an IntStream and map it to an object via the lambda i -> (char)i, this will automatically box it into a Stream<Character>, and then we can do what we want, and still use method references as a plus.

Be aware though that you must do mapToObj, if you forget and use map, then nothing will complain, but you will still end up with an IntStream, and you might be left off wondering why it prints the integer values instead of the strings representing the characters.

Other ugly alternatives for Java 8:

By remaining in an IntStream and wanting to print them ultimately, you cannot use method references anymore for printing:

hello.chars()
        .forEach(i -> System.out.println((char)i));

Moreover, using method references to your own method do not work anymore! Consider the following:

private void print(char c) {
    System.out.println(c);
}

and then

hello.chars()
        .forEach(this::print);

This will give a compile error, as there possibly is a lossy conversion.

Conclusion:

The API was designed this way because of not wanting to add CharStream, I personally think that the method should return a Stream<Character>, and the workaround currently is to use mapToObj(i -> (char)i) on an IntStream to be able to work properly with them.

skiwi
  • 66,971
  • 31
  • 131
  • 216
  • 8
    My conclusion: this part of API is broken by design. But thanks for extensive answer – Adam Dyga Mar 16 '14 at 16:39
  • 33
    +1, but my proposal is to use `codePoints()` instead of `chars()` and you will find a lot of library functions already accepting an `int` for code point additionally to `char`, e.g. all methods of `java.lang.Character` as well as `StringBuilder.appendCodePoint`, etc. This support exists since `jdk1.5`. – Holger Mar 18 '14 at 09:01
  • 7
    Good point about code points. Using them will handle supplementary characters, which are represented as surrogate pairs in a `String` or `char[]`. I'd bet that most `char` processing code mishandles surrogate pairs. – Stuart Marks Mar 19 '14 at 06:02
  • 2
    @skiwi, define `void print(int ch) { System.out.println((char)ch); }` and then you can use method references. – Stuart Marks Mar 19 '14 at 06:24
  • 3
    See my answer for why `Stream` was rejected. – Stuart Marks Mar 19 '14 at 06:25
  • 1
    But why is the second approach 'ugly' ? – iAmLearning Sep 11 '19 at 05:58
  • One alternative is `Stream.of(hello.split("")).forEach(System.out::println)` – cxs1031 Jul 27 '20 at 06:08
104

The answer from skiwi covered many of the major points already. I'll fill in a bit more background.

The design of any API is a series of tradeoffs. In Java, one of the difficult issues is dealing with design decisions that were made long ago.

Primitives have been in Java since 1.0. They make Java an "impure" object-oriented language, since the primitives are not objects. The addition of primitives was, I believe, a pragmatic decision to improve performance at the expense of object-oriented purity.

This is a tradeoff we're still living with today, nearly 20 years later. The autoboxing feature added in Java 5 mostly eliminated the need to clutter source code with boxing and unboxing method calls, but the overhead is still there. In many cases it's not noticeable. However, if you were to perform boxing or unboxing within an inner loop, you'd see that it can impose significant CPU and garbage collection overhead.

When designing the Streams API, it was clear that we had to support primitives. The boxing/unboxing overhead would kill any performance benefit from parallelism. We didn't want to support all of the primitives, though, since that would have added a huge amount of clutter to the API. (Can you really see a use for a ShortStream?) "All" or "none" are comfortable places for a design to be, yet neither was acceptable. So we had to find a reasonable value of "some". We ended up with primitive specializations for int, long, and double. (Personally I would have left out int but that's just me.)

For CharSequence.chars() we considered returning Stream<Character> (an early prototype might have implemented this) but it was rejected because of boxing overhead. Considering that a String has char values as primitives, it would seem to be a mistake to impose boxing unconditionally when the caller would probably just do a bit of processing on the value and unbox it right back into a string.

We also considered a CharStream primitive specialization, but its use would seem to be quite narrow compared to the amount of bulk it would add to the API. It didn't seem worthwhile to add it.

The penalty this imposes on callers is that they have to know that the IntStream contains char values represented as ints and that casting must be done at the proper place. This is doubly confusing because there are overloaded API calls like PrintStream.print(char) and PrintStream.print(int) that differ markedly in their behavior. An additional point of confusion possibly arises because the codePoints() call also returns an IntStream but the values it contains are quite different.

So, this boils down to choosing pragmatically among several alternatives:

  1. We could provide no primitive specializations, resulting in a simple, elegant, consistent API, but which imposes a high performance and GC overhead;

  2. we could provide a complete set of primitive specializations, at the cost of cluttering up the API and imposing a maintenance burden on JDK developers; or

  3. we could provide a subset of primitive specializations, giving a moderately sized, high performing API that imposes a relatively small burden on callers in a fairly narrow range of use cases (char processing).

We chose the last one.

Community
  • 1
  • 1
Stuart Marks
  • 127,867
  • 37
  • 205
  • 259
  • 2
    Nice answer! However it doesn't answer why there cannot be two different methods for `chars()`, one that returns a `Stream` (with small performance penalty) and other being `IntStream`, was this also considered? It is quite likely that people will end up mapping it to a `Stream` anyway if they think the convinience is worth it over the performance penalty. – skiwi Mar 19 '14 at 08:03
  • 4
    Minimalism comes in here. If there's already `chars()` method that returns the char values in an `IntStream`, it doesn't add much to have another API call that gets the same values but in boxed form. The caller can box the values without much trouble. Sure it would be more convenient not to have to do this in this (probably rare) case, but at the cost of adding clutter to the API. – Stuart Marks Mar 19 '14 at 21:30
  • 5
    Thanks to duplicate question I noticed this one. I agree that `chars()` returning `IntStream` is not a big problem especially given the fact that this method it rarely used at all. However it would be good to have a built-in way to convert back `IntStream` to the `String`. It can be done with `.reduce(StringBuilder::new, (sb, c) -> sb.append((char)c), StringBuilder::append).toString()`, but it's really long. – Tagir Valeev Jun 30 '15 at 02:04
  • 7
    @TagirValeev Yes it is somewhat cumbersome. With a stream of code points (an IntStream) it isn't too bad: `collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append).toString()`. I guess it's not really shorter, but using code points avoids the `(char)` casts and allows the use of method references. Plus it handles surrogates properly. – Stuart Marks Jun 30 '15 at 05:26
  • 1
    You could use: .collect(Collectors.joining()) – Ilya Bystrov Mar 01 '16 at 14:10
  • 2
    @IlyaBystrov Unfortunately the primitive streams such as `IntStream` don't have a `collect()` method that takes a `Collector`. They have only a three-arg `collect()` method as mentioned in previous comments. – Stuart Marks Mar 01 '16 at 17:19
  • @StuartMarks, Re "..I would have left out int..", you kidding? `int` is the only one that is guaranteed to be atomic, not long and double. – Pacerier Sep 20 '17 at 02:29
  • 1
    @Pacerier Kidding, but only slightly. The non-atomicity of long and double isn't really relevant in streams, as none of these values appear in fields that are shared across threads. From a pure API perspective, there's nothing that an int can do that a long can't also do, so in that sense int is redundant. I suspect the real reason for having `IntStream` is performance, since it has to move only half the data a `LongStream` does. I haven't measured this, though. – Stuart Marks Sep 20 '17 at 21:11
  • 2
    Very unfortunate and will cost thousands of hours of developer time because natural constructs like:Arrays.stream(char[]) don't work, and developers will be Google-ing for an alternative. At the very least Arrays.stream(char/byte/short[]) should exist and return IntStream. – kevin cline Oct 03 '18 at 19:07