Removing supplementary characters from a Java string

Question

I have a Java string that contains supplementary characters (characters in the Unicode standard whose code points are above U+FFFF). These characters could for example be emojis. I want to remove those characters from the string, i.e. replace them with the empty string "".

How do I remove supplementary characters from a string?
How do I remove characters from an arbitrary code point range? (For example all characters in the range 1F000–1FFFF)?

score 4 · Answer 1 · answered Nov 13 '17 at 15:55

4

There are a couple of approaches. As regex replace is expensive, maybe do:

String basic(String s) {
    StringBuilder sb = new StringBuilder();
    for (char ch : s.toCharArray()) {
        if (!Character.isLowSurrogate(ch) && !Character.isHighSurrogate(ch)) {
            sb.append(ch);
        }
    }
    return sb.length() == s.length() ? s : sb.toString();
}

answered Nov 13 '17 at 15:55

Joop Eggen

107,315
7
83
138

Good solution! [The docs says:](http://www.oracle.com/us/technologies/java/supplementary-142654.html) "Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF)." – matthiash Nov 14 '17 at 08:36
BTW, what would be the regex solution? – matthiash Nov 14 '17 at 08:37
1

@matthiash With regex there are two Unicode commands `(?u)` and `(?U)` and a `replaceAll("[\uD800-\uDBFF\uDC00-\uDFFF]", "")` should do. And java 8: `s.codePoints().filter(cp -> cp < 0x10000).toIntArray()` can be used to create a new String. – Joop Eggen Nov 14 '17 at 08:50

score 0 · Answer 2 · answered Nov 13 '17 at 15:26

You can get a character's unicode value by simply converting it to an int.

Therefore, you'll want to do the following:

Convert your String to a char[], or do something like have the loop condition iterate through each character in the String using String.charAt()
Check if the unicode value is one you want to remove.
If so, replace the character with "".

This is just to start you off, however if you're still struggling I can try type out a whole example.

Good luck!

score 0 · Answer 3 · answered Nov 13 '17 at 15:38

Here is a code snippet that collects characters between code point 60 and 100:

public class Test {

    public static void main(String[] args) {
        new Test().go();
    }

    private void go() {
        String s = "ABC12三￮";
        String ret = "";
        for (int i = 0; i < s.length(); i++) {
            System.out.println(s.codePointAt(i));

            if ((s.codePointAt(i) > 60) & (s.codePointAt(i) < 100)) {
                ret += s.substring(i, i+1);
            }
        }

        System.out.println(ret);
    }
}

the result:

code point: 65
code point: 66
code point: 67
code point: 49
code point: 50
code point: 19977
code point: 65518
result: ABC

Hope this helps.

Remy Lebeau · Answer 4 · 2017-11-15T01:21:52.410

Java strings are UTF-16 encoded. The String type has a codePointAt() method for retrieving a decoded codepoint at a given char (codeunit) index.

So, you can do something like this, for instance:

String removeSupplementaryChars(String s)
{
    int len = s.length();
    if (len == 0)
        return "";

    StringBuilder sb = new StringBuilder(len);
    int i = 0;

    do
    {
        if (s.codePointAt(i) <= 0xFFFF)
            sb.append(s.charAt[i]);

        i = s.offsetByCodePoints(i, 1);
    }
    while (i < len);

    return sb.toString();
}

Or this:

String removeCodepointsinRange(String s, int lower, int upper)
{
    int len = s.length();
    if (len == 0)
        return "";

    StringBuilder sb = new StringBuilder(len);
    int i = 0;

    do
    {
        int cp = s.codePointAt(i);

        if ((cp < lower) || (cp > upper))
            sb.appendCodePoint(cp);

        i = s.offsetByCodePoints(i, 1);
    }
    while (i < len);

    return sb.toString();
}

Removing supplementary characters from a Java string

4 Answers4