14

I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database.

I have written the following method for now and I am curious to know if there is a direct and optimal way to handle this.

Thanks in advance for your help.

public static String removeSurrogates(String query) {
    StringBuffer sb = new StringBuffer();
    for (int i = 0; i < query.length() - 1; i++) {
        char firstChar = query.charAt(i);
        char nextChar = query.charAt(i+1);
        if (Character.isSurrogatePair(firstChar, nextChar) == false) {
            sb.append(firstChar);
        } else {
            i++;
        }
    }
    if (Character.isHighSurrogate(query.charAt(query.length() - 1)) == false
            && Character.isLowSurrogate(query.charAt(query.length() - 1)) == false) {
        sb.append(query.charAt(query.length() - 1));
    }

    return sb.toString();
}
MaVRoSCy
  • 17,747
  • 15
  • 82
  • 125
Slowcoder
  • 2,060
  • 3
  • 16
  • 21

5 Answers5

10

Here's a couple things:

  • Character.isSurrogate(char c):

    A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.

  • Checking for pairs seems pointless, why not just remove all surrogates?

  • x == false is equivalent to !x

  • StringBuilder is better in cases where you don't need synchronization (like a variable that never leaves local scope).

I suggest this:

public static String removeSurrogates(String query) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < query.length(); i++) {
        char c = query.charAt(i);
        // !isSurrogate(c) in Java 7
        if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
            sb.append(firstChar);
        }
    }
    return sb.toString();
}

Breaking down the if statement

You asked about this statement:

if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
    sb.append(firstChar);
}

One way to understand it is to break each operation into its own function, so you can see that the combination does what you'd expect:

static boolean isSurrogate(char c) {
    return Character.isHighSurrogate(c) || Character.isLowSurrogate(c);
}

static boolean isNotSurrogate(char c) {
    return !isSurrogate(c);
}

...

if (isNotSurrogate(c)) {
    sb.append(firstChar);
}
Brendan Long
  • 53,280
  • 21
  • 146
  • 188
  • I am using Jdk 1.6.0. and I am not able to find Character.isSurrogate(c) method inbuilt. Is that something existing or did you give as an example? – Slowcoder Oct 12 '12 at 21:19
  • @Slowcoder Apparently it was added in Java 7. I switched to a version that works in Java 6. You can read the statement as "not a high surrogate or a low surrogate" instead of the (more complicated in my opinion) "not a high surrogate and not a low surrogate". – Brendan Long Oct 12 '12 at 21:23
  • If "c" is a low surrogate character, this code will append the character to "sb" because of the OR condition. Am I right? – Slowcoder Oct 12 '12 at 22:00
  • @Slowcoder No, check the parenthesis. If `isLowSurrogate(c)` is true, then `isHighSurrogate(c) || isLowSurrogate(c)` (because `x || true` is true), so `!(isHighSurrogate(c) || isLowSurrogate(c))` is false, so it won't be appended. Feel free to use the other version if this is too confusing, but I'd advise learning how to handle complex logic statements since they come up sometimes (I took a Logic class for part of my philosophy credits and it was pretty useful). – Brendan Long Oct 12 '12 at 22:08
  • I added a break down into functions where each step is simple. This is what I recommend doing whenever a logic statement becomes too complicated to understand. – Brendan Long Oct 12 '12 at 22:15
  • My apologies, I didnt notice the missing negation for low surrogate. It makes perfect sense now. Thanks. – Slowcoder Oct 12 '12 at 22:27
7

Java strings are stored as sequences of 16-bit chars, but what they represent is sequences of unicode characters. In unicode terminology, they are stored as code units, but model code points. Thus, it's somewhat meaningless to talk about removing surrogates, which don't exist in the character / code point representation (unless you have rogue single surrogates, in which case you have other problems).

Rather, what you want to do is to remove any characters which will require surrogates when encoded. That means any character which lies beyond the basic multilingual plane. You can do that with a simple regular expression:

return query.replaceAll("[^\u0000-\uffff]", "");
Tom Anderson
  • 46,189
  • 17
  • 92
  • 133
2

why not simply

for (int i = 0; i < query.length(); i++) 
    char c = query.charAt(i);
    if(!isHighSurrogate(c) && !isLowSurrogate(c))
        sb.append(c);

you probably should replace them with "?", instead of out right erasing them.

irreputable
  • 44,725
  • 9
  • 65
  • 93
  • Really helpful, Thanks. So I assume iterating by each character is the only way to remove them and there is no direct method that gets a string as a parameter and returns the string with surrogates removed. Am I right? – Slowcoder Oct 12 '12 at 21:23
  • such method won't exist in JDK. – irreputable Oct 12 '12 at 21:29
1

Just curious. If char is high surrogate is there a need to check the next one? It is supposed to be low surrogate. The modified version would be:

public static String removeSurrogates(String query) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < query.length(); i++) {
        char ch = query.charAt(i);
        if (Character.isHighSurrogate(ch))
            i++;//skip the next char is it's supposed to be low surrogate
        else
            sb.append(ch);
    }    
    return sb.toString();
}
Fedor
  • 43,261
  • 10
  • 79
  • 89
0

if remove, all these solutions are useful but if repalce, below is better

StringBuffer sb = new StringBuffer();
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if(Character.isHighSurrogate(c)){
            sb.append('*');
        }else if(!Character.isLowSurrogate(c)){
            sb.append(c);
        }
    }
    return sb.toString();