I have files which contain non-printing characters such as \u2066-\u2069
(directional formatting) and \u2000-\u2009
(spaces of various widths, e.g.  
). Is it possible to remove (or replace) them by using a (Java) regex? (\\s+
does not work with the above). I don't want to build this myself as I don't know what characters I might get.

- 37,407
- 44
- 153
- 217
-
Did you try `s.replaceAll("(?U)\\s+" , "")`? – Wiktor Stribiżew Nov 01 '19 at 09:31
-
Do you need to remove all Unicode whitespace? – Wiktor Stribiżew Nov 01 '19 at 09:34
-
In Java you can also try [`\P{Graph}+`](https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html) for one or more *non visible* characters, same like `[^\p{Alnum}\p{Punct}]`. There is further `\P{Print}` available for *non printable*. – bobble bubble Nov 01 '19 at 10:52
-
1Thanks @bobble bubble . Suggest you post this as an answer. – peter.murray.rust Nov 01 '19 at 15:16
2 Answers
All the characters you provided belong to the Separator, space Unicode category, so, you may use
s = s.replaceAll("\\p{Zs}+", " ");
The Zs
Unicode category stands fro space separators of any kind (see more cateogry names in the documentation).
To replace all horizontal whitespaces with a single regular ASCII space you may use
s = s.replaceAll("\\h+", " ");
As per Java regex documentation,
\h
A horizontal whitespace character:[ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]
If you want to shrink all Unicode whitespace to a single space
s = s.replaceAll("(?U)\\s+", " ");
The (?U)
is an embedded flag option equal to the Pattern.UNICODE_CHARACTER_CLASS
option passed to the Pattern.compile
method. Without it, \s
matches what \p{Space}
matches, i.e. [ \t\n\x0B\f\r]
. Once you pass (?U)
, it will start matching all whitespace chars in the Unicode table.
To tokenize a string, you may split directly with
String[] tokens = s.split("\\p{Zs}+");
String[] tokens = s.split("\\h+");
String[] tokens = s.split("(?U)\\s+");

- 607,720
- 39
- 448
- 563
There is also a POSIX like [^[:graph:]]
available. For one or more non visible characters, try
\P{Graph}+
The upper P indicates a negation of \p{Graph}
and would match one or more [^\p{Alnum}\p{Punct}]
or [\p{Z}\p{C}]
. Downside is, that it's US-ASCII only according to the manual. If working with UTF-8 consider using inline flag (?U)
or UNICODE_CHARACTER_CLASS
.
Just to mention, there is further \P{Print}
available for non printable characters.

- 16,888
- 3
- 27
- 46