0

I have a string that I would like to iterate over and extract the "characters". However, this string is in Japanese and some of the "characters" span the length of two characters instead of one.

For example "" is a string that has length 4. The Unicode characters each span the length of 2 chars.

How can I extract each substring that represents a word from this string? In this case, String.charAt(int i) will not work.

waylonion
  • 6,866
  • 8
  • 51
  • 92
  • 3
    `Character.codePointAt()` – Aniket Sahrawat Jan 16 '18 at 19:28
  • 3
    Possible duplicate of [Java charAt used with characters that have two code units](https://stackoverflow.com/questions/14150530/java-charat-used-with-characters-that-have-two-code-units) – Eli Sadoff Jan 16 '18 at 19:29
  • 1
    @AniketSahrawat: `String` also has a `codePointAt()` method. Also look at `String.offsetByCodePoints()`. – Remy Lebeau Jan 16 '18 at 23:40
  • 1
    The distinction between UTF-16 code unit and [Unicode](http://www.unicode.org/charts/nameslist/index.html) codepoint isn't just relevant to certain script blocks, but also to modern text in general. For example, music, mathematics, symbols, emoticons, …, …,… . – Tom Blodget Jan 18 '18 at 00:42
  • Possible duplicate of [How to correctly compute the length of a String in Java?](https://stackoverflow.com/questions/6828076/how-to-correctly-compute-the-length-of-a-string-in-java) – Tom Blodget Jan 18 '18 at 00:43

1 Answers1

0

The issue is that your locale is not configured for the character set you are using. See this link for setting the correct locale and examples. https://docs.oracle.com/javase/tutorial/i18n/locale/create.html

Mike Murphy
  • 1,006
  • 8
  • 16