114

So I know about String#codePointAt(int), but it's indexed by the char offset, not by the codepoint offset.

I'm thinking about trying something like:

But my concerns are

  • I'm not sure whether codepoints which are naturally in the high-surrogates range will be stored as two char values or one
  • this seems like an awful expensive way to iterate through characters
  • someone must have come up with something better.
Florian Weimer
  • 32,022
  • 3
  • 48
  • 92
rampion
  • 87,131
  • 49
  • 199
  • 315

4 Answers4

158

Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.

If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}
kmort
  • 2,848
  • 2
  • 32
  • 54
Jonathan Feinberg
  • 44,698
  • 7
  • 80
  • 103
  • 2
    As for whether or not it's "expensive", well... there is no other way built into Java. But if you're dealing only with Latin/European/Cyrillic/Greek/Hebrew/Arabic scripts, then you just s.charAt() to your heart's content. :) – Jonathan Feinberg Oct 06 '09 at 20:25
  • 25
    But you shouldn't. For instance if your program outputs XML and if someone gives it some obscure mathematical operator, suddenly your XML may be invalid. – Mechanical snail Jul 15 '12 at 01:18
  • @Jonathan Feinberg That's what I thought. But here came that special mathematical E. UTF-16 works 99% of the time — but then it get really painful. Especially when the problems stay hidden for a long time. – Martin Feb 09 '14 at 13:12
  • 2
    I would have used `offset = s.offsetByCodePoints(offset, 1);`. Is there some benefit in using `offset += Character.charCount(codepoint);` instead? – Paul Groke Jan 12 '15 at 21:35
  • 4
    @PaulGroke Yes there is. The function `offsetByCodePoints` (it redirects to `Character.offsetByCodePoints`) is like 50 lines long with loops and stuff, meanwhile `charCount` is just a one liner with a numeric `if`, so I guess there is a lot of performance loss. – Sipka Aug 30 '15 at 20:20
  • 3
    @Mechanicalsnail I don't understand your comment. Why would outputting XML cause this answer to misbehave? – Gili Sep 22 '15 at 18:41
  • 5
    @Gili the answer is fine. He was referring to @Jonathan Feinberg's comment in which he advocates for using `charAt()` which is a bad idea – RecursiveExceptionException Feb 17 '18 at 18:28
  • 1
    "If you know you'll be dealing with characters outside the BMP" this is a bad omen. – lmat - Reinstate Monica Oct 28 '18 at 11:22
  • 2
    Small modification to make it more `continue`-friendly: `final int length = s.length(); for (int codepoint, offset = 0; offset < length; offset += Character.charCount(codepoint)) { codepoint = s.codePointAt(offset); // do something with the codepoint }` – imgx64 Apr 09 '19 at 06:03
  • Proposed approach brakes the rule of not changing the value of loop counter from within the body of the loop itself. – user07 May 14 '20 at 18:28
  • 1
    @user07 What rule? – Jonathan Feinberg Sep 22 '20 at 16:05
  • @JonathanFeinberg I'm referring to SonarQube rule https://rules.sonarsource.com/java/RSPEC-1994 labeled as critical. It is raised in the case when counter is incremented in the body of loop method instead of being incremented in a dedicated its increment clause. – user07 Sep 28 '20 at 02:25
  • @user07 That's a silly rule of thumb that only makes sense for very basic loops. This is a more complex case which justifies an exception. – Gili Nov 17 '22 at 04:15
  • 1
    As a supplement, `String rune = new String(new int[]{codepoint}, 0, 1);` can be used to turn a codepoint into a readable single-char UTF-8 string – Liu Hao Mar 02 '23 at 12:21
87

Java 8 added CharSequence#codePoints which returns an IntStream containing the code points. You can use the stream directly to iterate over them:

string.codePoints().forEach(c -> ...);

or with a for loop by collecting the stream into an array:

for(int c : string.codePoints().toArray()){
    ...
}

These ways are probably more expensive than Jonathan Feinbergs's solution, but they are faster to read/write and the performance difference will usually be insignificant.

Community
  • 1
  • 1
Alex - GlassEditor.com
  • 14,957
  • 5
  • 49
  • 49
  • 3
    `for (int c : (Iterable) () -> string.codePoints().iterator())` also works. –  Jul 12 '17 at 23:13
  • 2
    Slightly shorter version of @saka1029:s code: `for (int c : (Iterable) string.codePoints()::iterator) ...` – Lii Mar 24 '18 at 09:40
9

Thought I'd add a workaround method that works with foreach loops (ref), plus you can convert it to java 8's new String#codePoints method easily when you move to java 8:

You can use it with foreach like this:

 for(int codePoint : codePoints(myString)) {
   ....
 }

Here's the method:

public static Iterable<Integer> codePoints(final String string) {
  return new Iterable<Integer>() {
    public Iterator<Integer> iterator() {
      return new Iterator<Integer>() {
        int nextIndex = 0;
        public boolean hasNext() {
          return nextIndex < string.length();
        }
        public Integer next() {
          int result = string.codePointAt(nextIndex);
          nextIndex += Character.charCount(result);
          return result;
        }
        public void remove() {
          throw new UnsupportedOperationException();
        }
      };
    }
  };
}

Or alternately if you just want to convert a string to an array of int codepoints (if your code could use a codepoint int array more easily) (might use more RAM than the above approach):

 public static List<Integer> stringToCodePoints(String in) {
    if( in == null)
      throw new NullPointerException("got null");
    List<Integer> out = new ArrayList<Integer>();
    final int length = in.length();
    for (int offset = 0; offset < length; ) {
      final int codepoint = in.codePointAt(offset);
      out.add(codepoint);
      offset += Character.charCount(codepoint);
    }
    return out;
  }

Thankfully uses "codePointAt" which safely handles the surrogate pair-ness of UTF-16 (java's internal string representation).

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
6

Iterating over code points is filed as a feature request at Sun.

See Bug Report

There is also an example on how to iterate over String CodePoints there.

Gili
  • 86,244
  • 97
  • 390
  • 689
Alexander Egger
  • 5,132
  • 1
  • 28
  • 42
  • 7
    Java 8 now has a codePoints() method built in to String: http://docs.oracle.com/javase/8/docs/api/java/lang/CharSequence.html#codePoints – Dov Wasserman Apr 18 '14 at 17:13