How can I replace non-printable Unicode characters in Java?

Question

The following will replace ASCII control characters (shorthand for [\x00-\x1F\x7F]):

my_string.replaceAll("\\p{Cntrl}", "?");

The following will replace all ASCII non-printable characters (shorthand for [\p{Graph}\x20]), including accented characters:

my_string.replaceAll("[^\\p{Print}]", "?");

However, neither works for Unicode strings. Does anyone has a good way to remove non-printable characters from a unicode string?

Just as an addendum: the list of Unicode General Categories can be found in [UAX #44](http://unicode.org/reports/tr44/#GC_Values_Table) — McDowell, Jun 01 '11 at 10:32
Possible duplicate of [Fastest way to strip all non-printable characters from a Java String](http://stackoverflow.com/questions/7161534/fastest-way-to-strip-all-non-printable-characters-from-a-java-string) — Stewart, Oct 14 '16 at 17:34
@Stewart: hi, have you looked at the question/answers besides the title?!? — dagnelies, Oct 14 '16 at 18:09
@Stewart: that other question covers only the ascii subset of non-printable characters!!! — dagnelies, Oct 14 '16 at 18:12

score 167 · Accepted Answer · edited Jun 15 '15 at 00:10

167

my_string.replaceAll("\\p{C}", "?");

See more about Unicode regex. java.util.regexPattern/String.replaceAll supports them.

edited Jun 15 '15 at 00:10

David Foerster

1,461
1
14
23

answered Jun 01 '11 at 09:56

Op De Cirkel

28,647
6
40
53

In java 1.6 at least, there is no support for them. http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html ...I also tried your line out, and besides of missing a backslash, it plainly simply doesn't work. – dagnelies Jun 01 '11 at 10:00
This works: `char c = 0xFFFA; String.valueOf(c).replaceAll("\\p{C}", "?");` also in the javadoc for pattern look in the **Unicode support** section, says it supports the categories – Op De Cirkel Jun 01 '11 at 10:18
You are right! I apologize. I didn't noticed it because I had to add the Zl Zp categories since those were mostly the source of issues. It works perfectly. Could you please make a mini edit to your post so I can vote it up again? – dagnelies Jun 01 '11 at 10:29
9

There are also invisible whitespace characters (like 0x0200B), which are part of \p{Zs} group. Unfortunately, this one includes also normal whitespaces. For those who are trying to filter an input string that shouldn't contain any spaces, the string `s.replaceAll("[\\p{C}\\p{Z}]", "")` will do the charm – Andrey L Aug 30 '13 at 14:35
2

This is what I was looking for, I was trying `replaceAll("[^\\u0000-\\uFFFF]", "")` but had no success – Bibaswann Bandyopadhyay Mar 31 '16 at 19:53
Thank you for you answer. We are using this filter, but then emoji character got stripped. How do you strip non-printable char without touching emoji? – quzhi65222714 Oct 23 '17 at 21:22
Why are you replacing them with "?" – Roland May 21 '21 at 11:35
3

Attention: this solution presented here (with 150 upvotes) also removes line breaks, which you might don't want to be replaced. – basZero Aug 20 '21 at 09:12

score 71 · Answer 2 · answered Sep 03 '13 at 23:24

Op De Cirkel is mostly right. His suggestion will work in most cases:

myString.replaceAll("\\p{C}", "?");

But if myString might contain non-BMP codepoints then it's more complicated. \p{C} contains the surrogate codepoints of \p{Cs}. The replacement method above will corrupt non-BMP codepoints by sometimes replacing only half of the surrogate pair. It's possible this is a Java bug rather than intended behavior.

Using the other constituent categories is an option:

myString.replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "?");

However, solitary surrogate characters not part of a pair (each surrogate character has an assigned codepoint) will not be removed. A non-regex approach is the only way I know to properly handle \p{C}:

StringBuilder newString = new StringBuilder(myString.length());
for (int offset = 0; offset < myString.length();)
{
    int codePoint = myString.codePointAt(offset);
    offset += Character.charCount(codePoint);

    // Replace invisible control characters and unused code points
    switch (Character.getType(codePoint))
    {
        case Character.CONTROL:     // \p{Cc}
        case Character.FORMAT:      // \p{Cf}
        case Character.PRIVATE_USE: // \p{Co}
        case Character.SURROGATE:   // \p{Cs}
        case Character.UNASSIGNED:  // \p{Cn}
            newString.append('?');
            break;
        default:
            newString.append(Character.toChars(codePoint));
            break;
    }
}

score 11 · Answer 3 · edited Jul 16 '21 at 02:15

methods below for your goal

public static String removeNonAscii(String str)
{
    return str.replaceAll("[^\\x00-\\x7F]", "");
}

public static String removeNonPrintable(String str) // All Control Char
{
    return str.replaceAll("[\\p{C}]", "");
}

public static String removeSomeControlChar(String str) // Some Control Char
{
    return str.replaceAll("[\\p{Cntrl}\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "");
}

public static String removeFullControlChar(String str)
{
    return removeNonPrintable(str).replaceAll("[\\r\\n\\t]", "");
}

Joachim Sauer · Answer 4 · 2011-06-01T10:48:48.617

8

You may be interested in the Unicode categories "Other, Control" and possibly "Other, Format" (unfortunately the latter seems to contain both unprintable and printable characters).

In Java regular expressions you can check for them using \p{Cc} and \p{Cf} respectively.

edited Jun 01 '11 at 10:48

answered Jun 01 '11 at 09:44

Joachim Sauer

302,674
57
556
614

Well, too bad java expressions don't have them, but at least I got the list right now... better than nothing. thanks – dagnelies Jun 01 '11 at 09:56

score 0 · Answer 5 · answered Sep 27 '18 at 11:13

0

I have used this simple function for this:

private static Pattern pattern = Pattern.compile("[^ -~]");
private static String cleanTheText(String text) {
    Matcher matcher = pattern.matcher(text);
    if ( matcher.find() ) {
        text = text.replace(matcher.group(0), "");
    }
    return text;
}

Hope this is useful.

answered Sep 27 '18 at 11:13

user1300830

83
1
10

What does that regular expression do? – ziggy Feb 24 '23 at 12:38

score 0 · Answer 6 · answered Oct 25 '18 at 21:04

Based on the answers by Op De Cirkel and noackjr, the following is what I do for general string cleaning: 1. trimming leading or trailing whitespaces, 2. dos2unix, 3. mac2unix, 4. removing all "invisible Unicode characters" except whitespaces:

myString.trim.replaceAll("\r\n", "\n").replaceAll("\r", "\n").replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}&&[^\\s]]", "")

Tested with Scala REPL.

score 0 · Answer 7 · edited Jan 24 '19 at 09:23

I propose it remove the non printable characters like below instead of replacing it

private String removeNonBMPCharacters(final String input) {
    StringBuilder strBuilder = new StringBuilder();
    input.codePoints().forEach((i) -> {
        if (Character.isSupplementaryCodePoint(i)) {
            strBuilder.append("?");
        } else {
            strBuilder.append(Character.toChars(i));
        }
    });
    return strBuilder.toString();
}

score 0 · Answer 8 · answered Apr 13 '21 at 09:45

Supported multilanguage

public static String cleanUnprintableChars(String text, boolean multilanguage)
{
    String regex = multilanguage ? "[^\\x00-\\xFF]" : "[^\\x00-\\x7F]";
    // strips off all non-ASCII characters
    text = text.replaceAll(regex, "");

    // erases all the ASCII control characters
    text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");

    // removes non-printable characters from Unicode
    text = text.replaceAll("\\p{C}", "");

    return text.trim();
}

score -4 · Answer 9 · edited May 23 '17 at 11:55

I have redesigned the code for phone numbers +9 (987) 124124 Extract digits from a string in Java

 public static String stripNonDigitsV2( CharSequence input ) {
    if (input == null)
        return null;
    if ( input.length() == 0 )
        return "";

    char[] result = new char[input.length()];
    int cursor = 0;
    CharBuffer buffer = CharBuffer.wrap( input );
    int i=0;
    while ( i< buffer.length()  ) { //buffer.hasRemaining()
        char chr = buffer.get(i);
        if (chr=='u'){
            i=i+5;
            chr=buffer.get(i);
        }

        if ( chr > 39 && chr < 58 )
            result[cursor++] = chr;
        i=i+1;
    }

    return new String( result, 0, cursor );
}

How can I replace non-printable Unicode characters in Java?

9 Answers9

Linked

Related