35

I'm looking for a regular expression in Java which matches all whitespace characters in a String. "\s" matches only some, it does not match   and similar non-ascii whitespaces. I'm looking for a regular expression which matches all (common) white-space characters which can occur in a Java String.

[Edit]

To clarify: I do not mean the string sequence " " I mean the sincle unicode character U+00A0 that is often represented by " ", e.g. in HTML, and all other unicode characters with a similar white-space meainig, e.g. "NARROW NO-BREAK SPACE" (U+202F), Word joiner encoded in Unicode 3.2 and above as U+2060, "ZERO WIDTH NO-BREAK SPACE" (U+FEFF) and any other character that can be regareded as white-space.

[Answer]

For my pupose, ie catching all whitespace characters, unicode + traditional, the following expression does the job:

[\p{Z}\s]

The answer is in the comments below but since it is a bit hidden I repeat it here.

Carsten
  • 4,204
  • 4
  • 32
  • 49

7 Answers7

41

  is not a whitespace character, as far as regexpes are concerned. You need to either modify the regexp to include those strings in addition to \s, like /(\s| |%20)/, or previously parse the string contents to get the ASCII or Unicode representation of the data.

You are mixing abstraction levels here.

If, what after a careful reread of the question seems to be the case, you are after a way to match all whitespace characters referring to standard ASCII plus the whitespace codepoints, \p{Z} or \p{Zs} will do the work.

You should really clarify your question because it has misled a lot of people (even making the correct answer to have some downvotes).

Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373
  • `\p{javaWhitespace}` does not seem to match `&nbsp` (U+00A0). – Carsten Nov 30 '09 at 23:58
  • 12
    Use `\p{Z}` or `\p{Zs}` instead. I've tested it in Java, and they do match U+00A0. – Alan Moore Dec 01 '09 at 00:02
  • But.. That's undocumented? http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html – BalusC Dec 01 '09 at 00:14
  • Yes and no, it's not on the javadoc, but it is in O'Reilly "Mastering regular expressions": http://books.google.com.au/books?id=ucwR4KIvExMC&pg=PA119&lpg=PA119&dq=regular+expression+unicode+whitespace+%22\p{Z}%22&source=bl&ots=QLyHsY8SOl&sig=PaWtJRDUkGIfNh7AALy6OdOwMA0&hl=en&ei=TGIUS6vlCcGIkAXvyOyuAw&sa=X&oi=book_result&ct=result&resnum=1&ved=0CAgQ6AEwAA#v=onepage&q=regular%20expression%20unicode%20whitespace%20%22\p{Z}%22&f=false – Carsten Dec 01 '09 at 00:25
  • No, best would be if Alan reposts his comment as answer. – BalusC Dec 01 '09 at 00:57
  • Andomar mentioned `\p{Z}` first, in a comment under his own answer. – Alan Moore Dec 01 '09 at 02:29
  • 1
    A good reference on `\p{Z}` and similar stuff is here: http://www.regular-expressions.info/unicode.html – Mike Aug 05 '15 at 16:26
  • @BalusC Actually it is almost documented. `The supported categories are those of The Unicode Standard in the version specified by the Character class.` which for 1.6 was version 4.0 , and section 2.4 has table 2-2 listing character class designations. Zs is listed, Z is not listed, but I suspect Z probably is supported for back compatibility with prior unicode versions, but I'm not going to bother looking up the prior versions of the unicode spec to check that... :) – Gus Aug 20 '16 at 16:14
12

You clarified the question the way as I expected: you're actually not looking for the String literal   as many here seem to think and for which the solution is too obvious.

Well, unfortunately, there's no way to match them using regex. Best is to include the particular codepoints in the pattern, for example: "[\\s\\xA0]".

Edit as turned out in one of the comments, you could use the undocumented "\\p{Z}" for this. Alan, can you please leave comment how you found that out? This one is quite useful.

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • 3
    It's one of the (many) standard Unicode property shorthands. They're mentioned in the Pattern API docs, though this one isn't among the examples. Here's a good overview: http://www.regular-expressions.info/unicode.html#prop But it's not as useful as it could be: it doesn't match linefeeds, tabs or (apparently) any other ASCII whitespace except the space (U+0020). Maybe that's why you never heard of it. :) – Alan Moore Dec 01 '09 at 02:46
  • Thanks for the overview. I really didn't expect that the undocumented ones would also work in Java's regex engine. That would mean that the API doc is incomplete (which I really wouldn't expect from the Sun guys, you know). – BalusC Dec 01 '09 at 21:35
  • Annoying that `\s` doesn't match `\xA0` -______________________- – ThorSummoner Aug 26 '14 at 20:28
11

The   is only whitespace in HTML. Use an HTML parser to extract the plain text. and \s should work just fine.

Andomar
  • 232,371
  • 49
  • 380
  • 404
  • The ` ` generates `\u00A0` at end. – BalusC Nov 30 '09 at 22:18
  • @BalusC: yes, but it's important that any sane definition of "whitespace character" in the context of regex can only include U+00A0 that is produced "at end", but can never include the literal "` `". That's what the "You are mixing abstraction levels here" of Vinkos answer is about (if I understood it correctly). – Joachim Sauer Nov 30 '09 at 22:21
  • 16
    @BalusC: Didn't know HTLM Parser did that. You could use `\p{Z}` instead of `\s` to match whitespace, it will match `\u00A0` – Andomar Nov 30 '09 at 22:25
  • @Joachim: Yes, also the "at end" part can produce something different from an Unicode code point, depending on various factors (mostly on who's interpreting the literal  ). – Vinko Vrsalovic Nov 30 '09 at 22:29
4

In case anyone runs into this question again looking for help, I suggest pursuing the following answer: https://stackoverflow.com/a/6255512/1678392

The short version: \\p{javaSpaceChar}

Why: Per the Pattern class, this maps the Character.isSpaceChar method:

Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname.

👍

nikodaemus
  • 1,918
  • 3
  • 21
  • 32
3

Click here for a summary I made of several competing definitions of "whitespace".

You might end up having to explicitly list the additional ones you care about that aren't matched by one of the prefab ones.

Kevin Bourrillion
  • 40,336
  • 12
  • 74
  • 87
  • Guava library reference this list as a "comparison of several definitions of 'whitespace'" ([source](http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/base/CharMatcher.html#WHITESPACE)). However, Kevin, you should give away your sources. Also, I wonder what that asterisk sign on the column "StreamTokenizer; String.trim()" is good for. And.. what is the first char listed.. something "(00-08)"? – Martin Andersson Apr 04 '13 at 19:06
2

  is not white space. It is a character encoding sequence that represents whitespace in HTML. You most likely want to convert HTML encoded text into plain text before running your string match against it. If that is the case, go look up javax.swing.text.html

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Zak
  • 24,947
  • 11
  • 38
  • 68
0

The regex characters are the only ones independent of encoding. Here is a list of some characters which - in Unicode - are non-printing:

How many non-printing characters are in common use?

Community
  • 1
  • 1
peter.murray.rust
  • 37,407
  • 44
  • 153
  • 217