16

I have a variable string that might contain any unicode character. One of these unicode characters is the han .

The thing is that this "han" character has "".length() == 2 but is written in the string as a single character.

Considering the code below, how would I iterate over all characters and compare each one while considering the fact it might contain one character with length greater than 1?

for ( int i = 0; i < string.length(); i++ ) {
    char character = string.charAt( i );
    if ( character == '' ) {
        // Fail, it interprets as 2 chars =/
    }
}

EDIT:
This question is not a duplicate. This asks how to iterate for each character of a String while considering characters that contains .length() > 1 (character not as a char type but as the representation of a written symbol). This question does not require previous knowledge of how to iterate over unicode code points of a Java String, although an answer mentioning that may also be correct.

Fagner Brack
  • 2,365
  • 4
  • 33
  • 69

4 Answers4

11
int hanCodePoint = "".codePointAt(0);
for (int i = 0; i < string.length();) {
    int currentCodePoint = string.codePointAt(i);
    if (currentCodePoint == hanCodePoint) {
        // do something here.
    }
    i += Character.charCount(currentCodePoint);
}
sstan
  • 35,425
  • 6
  • 48
  • 66
  • No way to compare with single quotes `''`? – Fagner Brack Jun 07 '15 at 04:10
  • 2
    unfortunately, no. `` is a valid Unicode character, but is not expressible as a single Java `char`, which is what you would need to be able to put it in single quotes. If you try, you will notice that you won't even be able to compile that. A java `char` can only represent Unicode characters up to code point 65,535. Past that, you need 2 surrogate `char`s to represent the character, or simply use a `String`. Very annoying, I agree. – sstan Jun 07 '15 at 04:15
9

The String.charAt and String.length methods treat a String as a sequence of UTF-16 code units. You want to treat the string as Unicode code-points.

Look at the "code point" methods in the String API:

  • codePointAt(int index) returns the (32 bit) code point at a given code-unit index
  • offsetByCodePoints(int index, int codePointOffset) returns the code-unit index corresponding to codePointOffset code-points from the code-unit at index.
  • codePointCount(int beginIndex, int endIndex) counts the code-points between two code-unit indexes.

Indexing the string by code point index is a bit tricky, especially if the string is long and you want to do it efficiently. However, it is a do-able, albeit that the code is rather cumbersome.

@sstan's answer is one solution.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
3

This will be simpler if you treat both the string and the data you're searching for as Strings. If you just need to test for the presence of that character:

if (string.contains("") {
    // do something here.
}

If you specifically need the index where that character appears:

int i = string.indexOf("");
if (i >= 0) {
    // do something with i here.
}

And if you really need to iterate through every code point, see How can I iterate through the unicode codepoints of a Java String? .

Community
  • 1
  • 1
Joe
  • 29,416
  • 12
  • 68
  • 88
  • What is the cost of time by using `.contains` or `.indexOf` for all characters I am testing? I am looking for a more generic approach instead of using `.contains` or `.indexOf` only for characters with `length > 1`. – Fagner Brack Jun 07 '15 at 15:36
  • This answer seems to be more closer to the question than iterating over unicode code points, although sacrificing some performance. – Fagner Brack Jun 07 '15 at 15:47
-4

An ASCII character takes half the amount a Unicode char does, so it's logical that the han character is of length 2. It not an ASCII char, nor a Unicode letter. If it were the second case, the letter would be displayed correctly.

user9138
  • 21
  • 3
  • An ASCII character in Unicode is the same size as it is in ASCII. What you're more referring to are multi-byte Unicode characters. – Makoto Jun 07 '15 at 03:53