8

I wish to remove all non-printable ascii characters from a string while retaining invisible ones. I thought this would work because whitespace, \n \r are invisible characters but not non-printable? Basically I am getting a byte array with � characters in it and I don't want them to be in it. So i am trying to convert it to a string, remove the � characters before using it as a byte array again.

Space works fine in my code now, however now \r and \n do not work. What would be the correct regex to retain these also? Or is there a better way that what I am doing?

public void write(byte[] bytes, int offset, int count) {

    try {
        String str = new String(bytes, "ASCII");
        str2 = str.replaceAll("[^\\p{Print}\\t\\n]", "");
        GraphicsTerminalActivity.sendOverSerial(str2.getBytes("ASCII"));

    } catch (UnsupportedEncodingException e) {

        e.printStackTrace();
    }

     return;
 }

} 

EDIT: I tried [^\x00-\x7F] which is the range of ascii characters....but then the � symbols still get through, weird.

Paul
  • 5,756
  • 6
  • 48
  • 78
  • 1
    Dont use \\t and \\n inside regex. Use them normaly \t \n since there are not regex character classes like \w \d \s. – Pshemo Jan 28 '13 at 16:01
  • 1
    These characters are probably not non-printable characters, but (Unicode) characters which your font does not support. Please provide us with an example string, possibly also piped through `od -t u1`. – Jens Erat Jan 28 '13 at 16:03
  • Ok I've stopped using \\t\\n, same functionality occurs. – Paul Jan 28 '13 at 16:10
  • 1
    @Ranon Yes I believe those characters are unicode characters, this is the character I'm receiving http://www.fileformat.info/info/unicode/char/fffd/index.htm When I type in a terminal emulator any character, such as g I get a string of "g���\r\n" So I want to remove the occurances of �. I think teh code is \uFFFd. These are correctly removed by my statement, but so are \r \n and \b which I need to retain. – Paul Jan 28 '13 at 16:13
  • I have found that java.lang.Character provides all the required functionality for character filtering. Maybe you do not need a regular expression after all. I have implemented a character filter for various junk characters that get pasted in text areas from word users and did not need anything other than this class. – dkateros Jan 28 '13 at 17:13
  • [FFFD is a special unicode character](http://www.fileformat.info/info/unicode/char/fffd/index.htm) representing characters that cannot be encoded in Unicode. You should better find out where these are coming from, could be something going wrong somewhere else... – Jens Erat Jan 28 '13 at 17:46
  • dkateros, how would you use it in this case? Do you specify characters you want or ones you dont want? Ranon they are coming from a library I use, so I have to filter these out as it is not my code. – Paul Jan 29 '13 at 09:55
  • Possible duplicate of [Fastest way to strip all non-printable characters from a Java String](http://stackoverflow.com/questions/7161534/fastest-way-to-strip-all-non-printable-characters-from-a-java-string) – Stewart Oct 14 '16 at 17:34

2 Answers2

13

The following regex will only match printable text

[^\x00\x08\x0B\x0C\x0E-\x1F]*

The following Regex will find non-printable characters

[\x00\x08\x0B\x0C\x0E-\x1F]

Jave Code:

boolean foundMatch = false;
try {
    Pattern regex = Pattern.compile("[\\x00\\x08\\x0B\\x0C\\x0E-\\x1F]");
    Matcher regexMatcher = regex.matcher(subjectString);
    foundMatch = regexMatcher.find();
    //Relace the found text with whatever you want
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}
abc123
  • 17,855
  • 7
  • 52
  • 82
  • I need certain non printable ones to get through such as \r \n \b. But I need to remove other non-printable characters which are causing � to appear. For instance [^\x00-\x7F] allows everything through, but \p{print} stops \n \r \b as well as the incorrect characters – Paul Jan 29 '13 at 11:34
1

Here I would prefer a simpler solution. BTW you ignored offset and count. The solution below overwrites the original array.

public void write(byte[] bytes, int offset, int count) {
    int writtenI = offset;
    for (int readI = offset; readI < offset + count; ++readI) {
        byte b = bytes[readI];
        if (32 <= b && b < 127) {
            // ASCII printable:
            bytes[writtenI] = bytes[readI]; // writtenI <= readI
            ++writtenI;
        }
    }
    byte[] bytes2 = new byte[writtenI - offset];
    System.arraycopy(bytes, offset, bytes2, 0, writtenI - offset);
    //String str = new String(bytes, offset, writtenI - offset, "ASCII");
    //bytes2 = str.getBytes("ASCII");
    GraphicsTerminalActivity.sendOverSerial(bytes2);
}
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • 1
    Thanks I'll give this a go, unfortunately my cable for testing broke and can't try this for a week. When you say // ASCII printable: is that only ascii printable characters you are getting? I need certain non printable ones to get through such as \r \n \b . For instance [^\x00-\x7F] allows everything through, but \p{print} stops \n \r \b as well as the incorrect characters. SO for me it is not a case of ignoring all non printable characters. – Paul Jan 29 '13 at 11:30
  • 1
    You might change it to `0 <= b && b <= 127`. Or as byte is signed: `b >= 0`, with comment `// ASCII 7 bits range`. – Joop Eggen Jan 29 '13 at 12:13
  • Yeah that is better, but for some reason the � characters get through. I have no idea why. I'll have to do some more testing to see what range gets rid of them and what does not... thanks – Paul Jan 29 '13 at 12:57