3

I'm trying to extract emojis and other special Characters from Strings for further processing (e.g. a String contains '' as one of its Characters).

But neither string.charAt(i) nor string.substring(i, i+1) work for me. The original String is formatted in UTF-8 and this means, that the escaped form of the above emoji is encoded as '\uD83D\uDE05'. That's why I receive '?' (\uD83D) and '?' (\uDE05) instead for this position, causing it to be at two positions when iterating over the String.

Does anyone have a solution to this problem?

Jongware
  • 22,200
  • 8
  • 54
  • 100
Paavo Pohndorff
  • 323
  • 1
  • 2
  • 17
  • For UTF-16 encoding use `str.getBytes("UTF-16");` – Cyrbil Jun 14 '15 at 18:51
  • 4
    You'll need to work with **code points** rather than `char`s. Emojis don't fit into 16-bit `char`s. See [How does Java 16 bit chars support Unicode?](http://stackoverflow.com/questions/1941613/how-does-java-16-bit-chars-support-unicode) and [How can I iterate through the Unicode codepoints of a Java string?](http://stackoverflow.com/questions/1527856/how-can-i-iterate-through-the-unicode-codepoints-of-a-java-string). – John Kugelman Jun 14 '15 at 18:52
  • @cyrbil How does that help? – John Kugelman Jun 14 '15 at 18:57
  • 3
    A java.lang.String isn't "formatted in UTF-8". Please state **in code** the format of your data and what you have tried to locate that character. – laune Jun 14 '15 at 19:05
  • so, work with the `"\uD83D\uDE05"` pair, instead of a single `char`. – ZhongYu Jun 14 '15 at 21:58

1 Answers1

1

Thanks to John Kugelman for the help. the solution looks like this now:

for(int codePoint : codePoints(string)) {

        char[] chars = Character.toChars(codePoint);
        System.out.println(codePoint + " : " + String.copyValueOf(chars));

    }

With the codePoints(String string)-method looking like this:

private static Iterable<Integer> codePoints(final String string) {
    return new Iterable<Integer>() {
        public Iterator<Integer> iterator() {
            return new Iterator<Integer>() {
                int nextIndex = 0;

                public boolean hasNext() {
                    return nextIndex < string.length();
                }

                public Integer next() {
                    int result = string.codePointAt(nextIndex);
                    nextIndex += Character.charCount(result);
                    return result;
                }

                public void remove() {
                    throw new UnsupportedOperationException();
                }
            };
        }
    };
}
Paavo Pohndorff
  • 323
  • 1
  • 2
  • 17