24

I get user input including non-ASCII characters and non-printable characters, such as

\xc2d
\xa0
\xe7
\xc3\ufffdd
\xc3\ufffdd
\xc2\xa0
\xc3\xa7
\xa0\xa0

for example:

email : abc@gmail.com\xa0\xa0
street : 123 Main St.\xc2\xa0

desired output:

  email : abc@gmail.com
  street : 123 Main St.

What is the best way to removing them using Java?
I tried the following, but doesn't seem to work

public static void main(String args[]) throws UnsupportedEncodingException {
        String s = "abc@gmail\\xe9.com";
        String email = "abc@gmail.com\\xa0\\xa0";

        System.out.println(s.replaceAll("\\P{Print}", ""));
        System.out.println(email.replaceAll("\\P{Print}", ""));
    }

Output

abc@gmail\xe9.com
abc@gmail.com\xa0\xa0
Raedwald
  • 46,613
  • 43
  • 151
  • 237
daydreamer
  • 87,243
  • 191
  • 450
  • 722
  • why do you want to remove them? – jtahlborn Jun 13 '12 at 18:17
  • 1
    @jtahlborn, Mongo fails to serialize these values – daydreamer Jun 13 '12 at 18:26
  • 1
    @daydreamer [citation needed] [\xc2d](http://www.codetable.net/hex/c2d) is a valid Unicode character. If MongoDB uses UTF-8 is should be able to serialize them. Perhaps you have an XY Problem here? How are you serializing your text? – Raedwald Nov 26 '18 at 11:11

7 Answers7

58

Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.

String clean = str.replaceAll("\\P{Print}", "");

Here, \p{Print} represents a POSIX character class for printable ASCII characters, while \P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because \ starts an escape sequence in string literals.)


Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.

This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.

That's my guess, but if I've misread the situation and you really do need to strip out literal \xHH escapes, you can do it with the following regular expression.

String clean = str.replaceAll("\\\\x\\p{XDigit}{2}", "");

The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.

erickson
  • 265,237
  • 58
  • 395
  • 493
  • this doesn't work. may be I am doing something incorrect, but not working – daydreamer Jun 18 '12 at 18:15
  • 1
    @daydreamer Can you provide an [SSCCE](http://sscce.org/) that shows what is not working? – erickson Jun 18 '12 at 18:19
  • public static void main(String args[]) throws UnsupportedEncodingException { String s = "abc@gmail\\xe9.com"; String email = "abc@gmail.com\\xa0\\xa0"; System.out.println(s.replaceAll("\\P{Print}", "")); System.out.println(email.replaceAll("\\P{Print}", "")); } out put - abc@gmail\xe9.com abc@gmail.com\xa0\xa0 – daydreamer Jun 18 '12 at 18:21
  • @daydreamer `\\x` doesn't mean anything special in Java source code. \\ in a `String` or `char` literal is an escape sequence that is replaced with \. If you want a Unicode escape, use `\uXXXX`, where XXXX is the Unicode point, in hexadecimal. – erickson Jun 18 '12 at 18:25
  • @daydreamer E.g. `String s = "abc@gmail\u00e9.com";` – erickson Jun 18 '12 at 18:27
  • ah I see, but the input I get is what I shared with you, does it mean it is not possible to strip it away? – daydreamer Jun 18 '12 at 18:28
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/12711/discussion-between-daydreamer-and-erickson) – daydreamer Jun 18 '12 at 18:29
  • No, you just need a different regular expression; your input is, apparently, all ASCII. Please see the update to my answer. – erickson Jun 18 '12 at 18:42
  • that seems to work, where I can learn more about creating such patterns like you created? please advice and thank you very much for your help, I appreciate it – daydreamer Jun 18 '12 at 18:46
  • @daydreamer I added a couple of links to great learning resources to the end of my answer. – erickson Jun 18 '12 at 18:52
  • how to remove this one � – Attaullah Oct 14 '19 at 00:03
  • "All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string." xD – paradocslover Mar 16 '21 at 09:37
16

With Google Guava's CharMatcher, you can remove any non-printable characters and then retain all ASCII characters (dropping any accents) like this:

String printable = CharMatcher.INVISIBLE.removeFrom(input);
String clean = CharMatcher.ASCII.retainFrom(printable);

Not sure if that's what you really want, but it removes anything expressed as escape sequences in your question's sample data.

Vsevolod Golovanov
  • 4,068
  • 3
  • 31
  • 65
Philipp Reichart
  • 20,771
  • 6
  • 58
  • 65
16

I know it's maybe late but for future reference:

String clean = str.replaceAll("\\P{Print}", "");

Removes all non printable characters, but that includes \n (line feed), \t(tab) and \r(carriage return), and sometimes you want to keep those characters.

For that problem use inverted logic:

String clean = str.replaceAll("[^\\n\\r\\t\\p{Print}]", "");
Ivan Pavić
  • 528
  • 4
  • 22
  • Upvoted for it's particular usefulness in mongo-land, to keep the shell from spewing ridiculous amounts of encoded non-ascii stuff (mongo really really prefers utf-8 if you want things to be easy) – Mark Mullin Feb 01 '16 at 02:15
  • 3
    Got error: illegal escape character String clean = str.replaceAll("[^\n\r\t\p{Print}]", ""); . \p should be \P – Well Smith Apr 17 '18 at 15:35
  • Really helped me a lot Thanks @Ivan – Prinkal Kumar Jun 12 '18 at 05:01
4

You can try this code:

public String cleanInvalidCharacters(String in) {
    StringBuilder out = new StringBuilder();
    char current;
    if (in == null || ("".equals(in))) {
        return "";
    }
    for (int i = 0; i < in.length(); i++) {
        current = in.charAt(i);
        if ((current == 0x9)
                || (current == 0xA)
                || (current == 0xD)
                || ((current >= 0x20) && (current <= 0xD7FF))
                || ((current >= 0xE000) && (current <= 0xFFFD))
                || ((current >= 0x10000) && (current <= 0x10FFFF))) {
            out.append(current);
        }

    }
    return out.toString().replaceAll("\\s", " ");
}

It works for me to remove invalid characters from String.

Paulius Matulionis
  • 23,085
  • 22
  • 103
  • 143
2

You can use java.text.normalizer

exception
  • 955
  • 2
  • 11
  • 23
1

Input => "This \u7279text \u7279is what I need" Output => "This text is what I need"

If you are trying to remove Unicode characters from a string like above this code will work

Pattern unicodeCharsPattern = Pattern.compile("\\\\u(\\p{XDigit}{4})");
Matcher unicodeMatcher = unicodeChars.matcher(data);
String cleanData = null;
if (unicodeMatcher.find()) {
    cleanData = unicodeMatcher.replaceAll("");
}
0

This simple function worked better for me:

function remove_non_ascii(str) {
  
    if ((str===null) || (str===''))
         return false;
   else
     str = str.toString();
    
    return str.replace(/[^\x20-\x7E]/g, '');
}