The regexp "\\P{Print}" used in Java deletes extended-Latin characters

Question

I see that the following line in Java deletes extended-Latin characters, which it should not do:

String finalStr = value.replaceAll("\\P{Print}", " ");

The \\P{Print} regexp is used to delete non-printable characters below Code 32. But the Extended-Latin chars below are not non-printable, they are printable.

Áéü œ, Š, Ÿ œ => , ,

String finalStr = value.replaceAll("\\P{Print}", " ");

In my case, I need to delete all non-printable chars <Code 32, but keep all other characters including Extended-Latin.

Please look at the documentation of [`java.util.regex.Pattern`](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/regex/Pattern.html) to see the definition of `\p{Print}`. It only includes characters in US-ASCII. The characters you mention aren't in ASCII, so using `\P{Print}` will exclude all non-printable ASCII characters and **all** non-ASCII character. If you want `\P{Print}` to use the Unicode definition, you must add the UNICODE_CHARACTER_CLASS flag or add `(?U)` before the start of your pattern. That is: `"(?U)\\P{Print}"`. — Mark Rotteveel, Apr 08 '23 at 07:43
Mark, I tried `\P{Print}` with `UNICODE_CHARACTER_CLASS` and it didn't work. What did work was `\p{Cntrl}`, i.e. anything outside of Control chars (0-32) — gene b., Apr 08 '23 at 13:12
Never mind, I see that `"(?U)\\P{Print}"` works, you're right. It's not exactly similar to `\\p{Cntrl}`, however: e.g. it will remove `65528` and `65534` from this list: `(65528,65529,65530,65531,65532,65533,65534)` whereas `\\p{Cntrl}` will keep them. So there's still a slight difference between `"(?U)\\P{Print}"` and `\\p{Cntrl}`. — gene b., Apr 09 '23 at 16:23

njzk2 · Answer 1 · 2023-04-07T18:48:39.367

2

According to the documentation, by default those classes apply to US-ASCII only

Specifically:

\p{Lower}   A lower-case alphabetic character: [a-z]
\p{Upper}   An upper-case alphabetic character:[A-Z]
\p{Alpha}   An alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit}   A decimal digit: [0-9]
\p{Alnum}   An alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Graph}   A visible character: [\p{Alnum}\p{Punct}]
\p{Print}   A printable character: [\p{Graph}\x20]

putting it all together, \p{print} matches [a-zA-Z0-9\x20] only.

However, further down the documentation, we can find:

The following Predefined Character classes and POSIX character classes are in conformance with the recommendation of Annex C: Compatibility Properties of Unicode Regular Expression , when UNICODE_CHARACTER_CLASS flag is specified.

\p{Cntrl}   A control character: \p{gc=Cc}

So, make sure to pass Pattern.UNICODE_CHARACTER_CLASS, and to use Pattern.compile("\\p{Cntrl}").matcher(value).replaceAll("")

It also seem that \p{Print} would work, but if you specifically want to remove control characters Cntrl is the right class.

Demonstration

import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Main {  
  public static void main(String args[]) {
    // Control characters are 00 to 1F included, this has 3
    String value = "Áéü œ, Š, Ÿ œ\u0000\u0010\u001f";
    String needle = "\\p{Cntrl}";
    Pattern pattern = Pattern.compile(needle, Pattern.UNICODE_CHARACTER_CLASS);
    String result = pattern.matcher(value).replaceAll("");
    System.out.println(value.length()); // Prints 16
    System.out.println(result.length()); // Prints 13
  }
}

edited Apr 07 '23 at 18:48

answered Apr 07 '23 at 17:41

njzk2

38,969
7
69
107

So what would be a good way to filter out non-printable Chars < Code 32 only? – gene b. Apr 07 '23 at 17:42
Actually, the documentation has a bit more to say about it. I'm editing my answer – njzk2 Apr 07 '23 at 17:45
This doesn't remove the non-printable <32 characters in the string: `String finalStr = Pattern.compile("\\p{Cntrl}", Pattern.UNICODE_CHARACTER_CLASS).matcher(value).replaceAll("");` – gene b. Apr 07 '23 at 18:29
As far as I can tell, it does. I'm adding a short example to demonstrate that – njzk2 Apr 07 '23 at 18:45
Yes, my mistake, sorry. I have a special char with code 65533 which is non-printable but technically falls outside 0-32, and this wasn't catching it. Is there a way to exclude both Control Chars <32 and anything beyond the end of Latin-Extended or beyond the end of current languages, so that something like 65533 is excluded? – gene b. Apr 07 '23 at 18:54
Also, why doesn't UNICODE_CHARACTER_CLASS work with the prior `\\P{Print}`? `String finalStr = Pattern.compile("\\P{Print}", Pattern.UNICODE_CHARACTER_CLASS).matcher(value).replaceAll("");` should work the same way, with an extended character class, right? But this one doesn't delete the special char 65533. – gene b. Apr 07 '23 at 19:04
I can only point you at the documentation, which shows that Print and Cntrl are defined differently. – njzk2 Apr 07 '23 at 19:06
OK. Please re-open the question, since no other thread is available that deals with this issue specifically (the other thread that was suggested doesn't mention this issue). – gene b. Apr 07 '23 at 19:06
1

You may want to look at `IsLatin` class, although I don't know if it includes the extended range. I can only vote to reopen the question, the person who closed, or other users may chose to do so as well. I recommend you edit your question to include the information you've provided in comments here, so that it gets better chances of being re-opened. – njzk2 Apr 07 '23 at 19:11
1

Alternatively, the UNICODE_CHARACTER_CLASS can also be enabled by prefixing the pattern with `(?U)`, e.g. `"(?U)\\P{Print}"`. – Mark Rotteveel Apr 08 '23 at 07:50

The regexp "\\P{Print}" used in Java deletes extended-Latin characters

1 Answers1

Demonstration