3

Since java 8 String.chars() returns an IntStream, and the best answer I have found if you want a stream of chars is by casting i -> (char) i, I was wondering if anybody knows if this works properly with UTF-16 chars that actually take up 8 bytes?

tumunu
  • 31
  • 3
  • 1
    8 bytes? Don't all Unicode characters fit into at most two UTF-16 code units (i.e. 4 bytes)? – Thilo Apr 27 '16 at 01:10

1 Answers1

2

Depending on your definition of properly: No, it does not.

A Java char is a 16 bit UTF-16 code unit. Anything that is longer than that is represented as two char (as "surrogate pairs").

The same goes for String#length(). It will return the number of char, so your "long character" will count as two.

The reason that an IntStream is returned is just to not need to introduce a CharStream class. The data contained will still just be in the char 16-bit range.

However, there is .codePoints() in addition to chars(), which does return the 32-bit Unicode codepoints (also as an IntStream).

Community
  • 1
  • 1
Thilo
  • 257,207
  • 101
  • 511
  • 656
  • Right. But what you're saying is, I have to parse the ints myself, right? – tumunu Apr 27 '16 at 01:05
  • 1
    What do you mean by "parse"? – Thilo Apr 27 '16 at 01:07
  • By "parse," I mean, when I look at the next int value in the IntStream, I have to examine the value to see if the int after it is actually part of the same char. – tumunu Apr 27 '16 at 01:14
  • 1
    That cannot happen. Each `char` results in one entry of the `IntStream`. You can simply do a `char x = (char) i` to "convert". (Some Unicode characters are represented as two `char`, but that is a different problem). – Thilo Apr 27 '16 at 01:16
  • Are you saying that when I call String.chars(), and one of the chars in the string is actually taking up 8 bytes, java 8 will stuff it into a 32 bit value? – tumunu Apr 27 '16 at 01:18
  • OK I've reread the docs for String.chars (inherited from CharSequence) "Any char which maps to a surrogate code point is passed through uninterpreted." Oh fun. And thanks for your help. – tumunu Apr 27 '16 at 01:23
  • 1
    But is there an 8-byte character? Should be at most 4 bytes, at least in UTF-16. – Thilo Apr 27 '16 at 01:45
  • 3
    Did you see in @Thilo's answer that you can call .codePoints() instead of chars() which will collapse surrogate pairs into one int? – Hank D Apr 27 '16 at 02:41
  • 1
    @tumunu: you can’t “stuff 8 Bytes into a 32 Bit value” as 8 bytes are 64 Bits. Unicode codepoints use 21 Bits which would fit even into three bytes, but for processing them, `int`s consisting of *four* bytes are usually used, which you can do in Java using `String.codePoints()`, which you might have overlooked as it is inherited from `CharSequence`. – Holger Apr 27 '16 at 08:51
  • First, an apology: due to some brain freeze I was experiencing yesterday, I kept writing "8 bytes" when I mean "4 bytes". I'm sure that confused you all! The basic question was just the fact that a primitive java char is 16 bits, but in UTF-16 you can have a 32 bit code point, and how does that all work with String.chars(). Thanks Thilo for your help. It seems that you should you only use chars() if you are absolutely sure you won't run into any surrogate code pairs, but I'm far too paranoid for such stuff. Seems to me using codePoints() is just all-around safer, if slower. – tumunu Apr 28 '16 at 07:46
  • Totally depends on your use-case (what you are going to do with the characters?). The surrogate pair `char` values you get back may be "ugly", but they are compatible with how Java Strings work throughout the rest of the API. You could construct a `String` from them for example, or send them out to a Writer. – Thilo Apr 28 '16 at 07:51
  • Hi Thilo, and thanks again for all your help. Unfortunately, I don't actually have a "use case," I was merely trying to understand String.chars() better. Based on the discussion and browsing the source, I think it's only there for performance (performance often being the supreme consideration, of course). – tumunu Apr 28 '16 at 21:07