40

I have a string coming from UI that may contains control characters, and I want to remove all control characters except carriage returns, line feeds, and tabs.

Right now I can find two way to remove all control characters:

1- using guava:

return CharMatcher.JAVA_ISO_CONTROL.removeFrom(string);

2- using regex:

return string.replaceAll("\\p{Cntrl}", "");
james.garriss
  • 12,959
  • 7
  • 83
  • 96
Mahmoud Saleh
  • 33,303
  • 119
  • 337
  • 498

8 Answers8

31

You can do something like this if you want to delete all characters in other or control uni-code category

System.out.println(
    "a\u0000b\u0007c\u008fd".replaceAll("\\p{Cc}", "")
); // abcd

Note : This actually removes (among others) '\u008f' Unicode character from the string, not the escaped form "%8F" string.

Courtesy : polygenelubricants ( Replace Unicode Control Characters )

Community
  • 1
  • 1
Nidhish Krishnan
  • 20,593
  • 6
  • 63
  • 76
  • 2
    This doesn't do what the author wanted, he wanted to preserver also new lines, line feeds and tabs. The above code will remove also those. – Krzysztof Krasoń Jun 19 '14 at 08:44
  • Thanks a lot! I spent all my day to find this bug in my code. Soap http request from java was returning http status 400 but soap-ui like test environments or curl were working properly with the "same" request xml. At last I found those "invisible" chars. :) – csonuryilmaz Oct 09 '14 at 16:53
  • 1
    At least put the answer in your own words..... http://stackoverflow.com/a/3439206/2347824 – ethanbustad Dec 22 '14 at 19:11
20

One option is to use a combination of CharMatchers:

CharMatcher charsToPreserve = CharMatcher.anyOf("\r\n\t");
CharMatcher allButPreserved = charsToPreserve.negate();
CharMatcher controlCharactersToRemove = CharMatcher.JAVA_ISO_CONTROL.and(allButPreserved);

Then use removeFrom as before. I don't know how efficient it is, but it's at least simple.


As noted in edits, JAVA_ISO_CONTROL is now deprecated in Guava; the javaIsoControl() method is preferred.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • `CharMatcher.javaIsoControl()` is now actual as `JAVA_ISO_CONTROL` is deprecated. – Zon Feb 10 '20 at 11:16
14

This seems to be an option

    String s = "\u0001\t\r\n".replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
    for (char c : s.toCharArray()) {
        System.out.print((int) c + " ");
    }

prints 9 13 10 just like you said "except carriage returns, line feeds, and tabs".

Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275
9

use these

public static String removeNoneAscii(String str){
    return str.replaceAll("[^\\x00-\\x7F]", "");
}

public static String removeNonePrintable(String str){ // All Control Char
    return str.replaceAll("[\\p{C}]", "");
}

public static String removeOthersControlChar(String str){ // Some Control Char
    return str.replaceAll("[\\p{Cntrl}\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "");
}

public static String removeAllControlChars(String str)
{
    return removeNonPrintable(str).replaceAll("[\\r\\n\\t]", "");
}
Ali Bagheri
  • 3,068
  • 27
  • 28
1

In Java regular expression, it is possible to exclude some characters in a character class. Here's a sample program demonstrate something similar:

class test {
    public static void main (String argv[]) {
            String testStr="abcdefABCDEF";
            System.out.println(testStr);
            System.out.println(testStr.replaceAll("[\\p{Lower}&&[^cd]]",""));
    }
}

It will produce this output:

abcdefABCDEF
cdABCDEF
Raymond Tau
  • 3,429
  • 26
  • 28
1

I'm using Selenium to test web screens. I use Hamcrest asserts and matchers to search the page source for different strings based on various conditions.

String pageSource = browser.getPageSource();
assertThat("Text not found!", pageSource, containsString(text));

This works just fine using an IE or Firefox driver, but it bombs when using the HtmlUnitDriver. The HtmlUnitDriver formats the page source with tabs, carriage returns, and other control characters. I am using a riff on Nidhish Krishnan's ingenious answer above. If I use Nidish's solution "out of the box," I am left with extra spaces, so I added a private method named filterTextForComparison:

String pageSource = filterTextForComparison(browser.getPageSource());
assertThat("Text not found!", pageSource, 
        containsString(filterTextForComparison(text)));

And the function:

/**
 * Filter out any characters embedded in the text that will interfere with
 * comparing Strings.
 * 
 * @param text
 *            the text to filter.
 * @return the text with any extraneous character removed.
 */
private String filterTextForComparison(String text) {

    String filteredText = text;

    if (filteredText != null) {
        filteredText = filteredText.replaceAll("\\p{Cc}", " ").replaceAll("\\s{2,}", " ");
    }

    return filteredText;
}

First, the method replaces the control characters with a space then it replaces multiple spaces with a single one. I tried doing everything at once with "\p{Cc}+?" but it didn't catch "\t " becoming " ".

Steve Gelman
  • 874
  • 9
  • 16
-1

Use StringUtils.deleteWhiteSpace(text) from Apache Commons Lang.

Martin Schröder
  • 4,176
  • 7
  • 47
  • 81
-1

U can use StingUtils from Spring:

String str = "\n\t\t\tsome text\t\t\n";
StringUtils.trimAllWhitespace(str); // some text
ivvasch
  • 1
  • 1
  • 1
    The question asks for ways to remove control characters EXCEPT whitespace control characers; this solution removes ONLY whitespace control characters. Also this solution removes space characters, which are not usually considered control characters. – user9712582 May 12 '22 at 02:56