According to the documentation, by default those classes apply to US-ASCII only
Specifically:
\p{Lower} A lower-case alphabetic character: [a-z]
\p{Upper} An upper-case alphabetic character:[A-Z]
\p{Alpha} An alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit} A decimal digit: [0-9]
\p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Graph} A visible character: [\p{Alnum}\p{Punct}]
\p{Print} A printable character: [\p{Graph}\x20]
putting it all together, \p{print}
matches [a-zA-Z0-9\x20]
only.
However, further down the documentation, we can find:
The following Predefined Character classes and POSIX character classes are in conformance with the recommendation of Annex C: Compatibility Properties of Unicode Regular Expression , when UNICODE_CHARACTER_CLASS flag is specified.
\p{Cntrl} A control character: \p{gc=Cc}
So, make sure to pass Pattern.UNICODE_CHARACTER_CLASS
, and to use Pattern.compile("\\p{Cntrl}").matcher(value).replaceAll("")
It also seem that \p{Print}
would work, but if you specifically want to remove control characters Cntrl
is the right class.
Demonstration
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Main {
public static void main(String args[]) {
// Control characters are 00 to 1F included, this has 3
String value = "Áéü œ, Š, Ÿ œ\u0000\u0010\u001f";
String needle = "\\p{Cntrl}";
Pattern pattern = Pattern.compile(needle, Pattern.UNICODE_CHARACTER_CLASS);
String result = pattern.matcher(value).replaceAll("");
System.out.println(value.length()); // Prints 16
System.out.println(result.length()); // Prints 13
}
}