3

I am trying to work around to remove symbols and special characters from a raw text in java and could not find way around. The text is taken from a free-text field in a website which may contain literally anything. I am taking this text from an external source and have no control to change setting. So I have to work around at my end. Some examples are

1) belem should be--> belem

2) Ariana should be--> Ariana

3) Harlem should be--> Harlem

4) Yz ️‍ should be--> Yz

5) ここさけは7回は見に行くぞ should be--> ここさけは7回は見に行くぞ

6) دمي ازرق وطني ازرق should be--> دمي ازرق وطني ازرق

Any help please?

user3212493
  • 165
  • 2
  • 11

3 Answers3

2

You can try this regex that find all emojis in a string :

regex = "[\\ud83c\\udc00-\\ud83c\\udfff]|[\\ud83d\\udc00-\\ud83d\\udfff]|[\\u2600-\\u27ff]"

then remove all the emojis in it using replaceAll() method:

String text = "ここさけは7回は見に行くぞ ";
String regex = "[\\ud83c\\udc00-\\ud83c\\udfff]|[\\ud83d\\udc00-\\ud83d\\udfff]|[\\u2600-\\u27ff]";
System.out.println(text.replaceAll(regex, ""));

Output:

ここさけは7回は見に行くぞ 
Oghli
  • 2,200
  • 1
  • 15
  • 37
  • @user3212493 if you find the answer helpful mark it as solved to be reference in the future. – Oghli Jun 19 '17 at 20:05
1

If you mean "special characters" are surrogate pairs, try this.

static String removeSpecial(String s) {
    int[] r = s.codePoints()
        .filter(c -> c < Character.MIN_SURROGATE)
        .toArray();
    return new String(r, 0, r.length);
}

and

String[] testStrs = {
    "belem ",
    "Ariana ",
    "Harlem ",
    "Yz ️‍",
    "ここさけは7回は見に行くぞ",
    "دمي ازرق وطني ازرق "
};

for (String s : testStrs)
    System.out.println(removeSpecial(s));

results

belem 
Ariana 
Harlem 
Yz ‍
ここさけは7回は見に行くぞ
دمي ازرق وطني ازرق 
0

Use a character class for white space and the POSIX character class for "any letter or number from any language":

str = str.replaceAll("[^\\s\\p{Alnum}]", "");
Bohemian
  • 412,405
  • 93
  • 575
  • 722