1

I have a Java string that contains supplementary characters (characters in the Unicode standard whose code points are above U+FFFF). These characters could for example be emojis. I want to remove those characters from the string, i.e. replace them with the empty string "".

  1. How do I remove supplementary characters from a string?
  2. How do I remove characters from an arbitrary code point range? (For example all characters in the range 1F000–​1FFFF)?
matthiash
  • 3,105
  • 3
  • 23
  • 34

4 Answers4

4

There are a couple of approaches. As regex replace is expensive, maybe do:

String basic(String s) {
    StringBuilder sb = new StringBuilder();
    for (char ch : s.toCharArray()) {
        if (!Character.isLowSurrogate(ch) && !Character.isHighSurrogate(ch)) {
            sb.append(ch);
        }
    }
    return sb.length() == s.length() ? s : sb.toString();
}
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • Good solution! [The docs says:](http://www.oracle.com/us/technologies/java/supplementary-142654.html) "Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF)." – matthiash Nov 14 '17 at 08:36
  • BTW, what would be the regex solution? – matthiash Nov 14 '17 at 08:37
  • 1
    @matthiash With regex there are two Unicode commands `(?u)` and `(?U)` and a `replaceAll("[\uD800-\uDBFF\uDC00-\uDFFF]", "")` should do. And java 8: `s.codePoints().filter(cp -> cp < 0x10000).toIntArray()` can be used to create a new String. – Joop Eggen Nov 14 '17 at 08:50
0

You can get a character's unicode value by simply converting it to an int.

Therefore, you'll want to do the following:

  • Convert your String to a char[], or do something like have the loop condition iterate through each character in the String using String.charAt()
  • Check if the unicode value is one you want to remove.
  • If so, replace the character with "".

This is just to start you off, however if you're still struggling I can try type out a whole example.

Good luck!

adickinson
  • 553
  • 1
  • 3
  • 14
0

Here is a code snippet that collects characters between code point 60 and 100:

public class Test {

    public static void main(String[] args) {
        new Test().go();
    }

    private void go() {
        String s = "ABC12三○";
        String ret = "";
        for (int i = 0; i < s.length(); i++) {
            System.out.println(s.codePointAt(i));

            if ((s.codePointAt(i) > 60) & (s.codePointAt(i) < 100)) {
                ret += s.substring(i, i+1);
            }
        }

        System.out.println(ret);
    }
}

the result:

code point: 65
code point: 66
code point: 67
code point: 49
code point: 50
code point: 19977
code point: 65518
result: ABC

Hope this helps.

chris
  • 1,685
  • 3
  • 18
  • 28
0

Java strings are UTF-16 encoded. The String type has a codePointAt() method for retrieving a decoded codepoint at a given char (codeunit) index.

So, you can do something like this, for instance:

String removeSupplementaryChars(String s)
{
    int len = s.length();
    if (len == 0)
        return "";

    StringBuilder sb = new StringBuilder(len);
    int i = 0;

    do
    {
        if (s.codePointAt(i) <= 0xFFFF)
            sb.append(s.charAt[i]);

        i = s.offsetByCodePoints(i, 1);
    }
    while (i < len);

    return sb.toString();
}

Or this:

String removeCodepointsinRange(String s, int lower, int upper)
{
    int len = s.length();
    if (len == 0)
        return "";

    StringBuilder sb = new StringBuilder(len);
    int i = 0;

    do
    {
        int cp = s.codePointAt(i);

        if ((cp < lower) || (cp > upper))
            sb.appendCodePoint(cp);

        i = s.offsetByCodePoints(i, 1);
    }
    while (i < len);

    return sb.toString();
}
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770