Why is non-breaking space not a whitespace character in Java?

Question

While searching for a proper way to trim non-breaking space from parsed HTML, I've first stumbled on Java's spartan definition of String.trim() which is at least properly documented. I wanted to avoid explicitly listing characters eligible for trimming, so I assumed that using Unicode backed methods on the Character class would do the job for me.

That's when I discovered that Character.isWhitespace(char) explicitly excludes non-breaking spaces:

It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').

Why is that?

The implementation of corresponding .NET equivalent is less discriminating.

score 23 · Accepted Answer · answered Jun 29 '09 at 21:50

23

Character.isWhitespace(char) is old. Really old. Many things done in the early days of Java followed conventions and implementations from C.

Now, more than a decade later, these things seem erroneous. Consider it evidence how far things have come, even between the first days of Java and the first days of .NET.

Java strives to be 100% backward compatible. So even if the Java team thought it would be good to fix their initial mistake and add non-breaking spaces to the set of characters that returns true from Character.isWhitespace(char), they can't, because there almost certainly exists software that relies on the current implementation working exactly the way it does.

answered Jun 29 '09 at 21:50

Steve McLeod

51,737
47
128
184

3

Regarding backward compatibility: I agree, but there is no reason why not to add, say, Character.isWhitespaceNew(char) to capture the current situation. – Jirka Oct 15 '12 at 18:42
7

And down the other road lies, well, Java. A language that blazed the trail for those that followed (who learned from its mistakes), but why anyone would voluntarily use it if they had other options is beyond my comprehension. – Eloff May 10 '13 at 14:17
It is still in the language because of backward compatibility, but it doesn't explain why it is originally like that. – Verneri Åberg Dec 20 '13 at 10:25
2

@Jirka well, they did add it, except it's called Character.isSpaceChar(char); it doesn't include line breaks though – aditsu quit because SE is EVIL Sep 04 '17 at 13:01

score 16 · Answer 2 · answered Sep 17 '09 at 10:58

16

Since Java 5 there is also an isSpaceChar(int) method. Does that not do what you want?

Determines if the specified character (Unicode code point) is a Unicode space character. A character is considered to be a space character if and only if it is specified to be a space character by the Unicode standard. This method returns true if the character's general category type is any of the following: ...

answered Sep 17 '09 at 10:58

Jesper

202,709
46
318
350

1

It's not so much the existence of such a method that the OP was looking for; but rather a `trim`-type function that *uses* that method to determine what to strip. – Andrzej Doyle Sep 17 '09 at 11:00
Note that there is also a `isSpaceChar(char)` method – Mmmh mmh Aug 13 '14 at 12:18
The isSpaceChar() method does not include latin white space (tab, for example). – zendu Feb 15 '18 at 09:39

score 14 · Answer 3 · edited Dec 28 '14 at 18:24

14

As posted above, isSpaceChar(int) will provide the OP with a track to the answer. It seems fairly discreetly documented, but this method is actually useable with regexes. So:

    "X\u00A0X X".replaceAll("\\p{javaSpaceChar}", "_");

will produce a "X_X_X" string. It is left as an exercise for the reader to come up with the regex to trim a string. (Pattern with some flags should do the trick.)

edited Dec 28 '14 at 18:24

törzsmókus

1,799
2
21
28

answered Jun 06 '11 at 17:04

Grégory Joseph

1,549
16
14

Works greate, needs extra " -> "X\u00A0XX".replaceAll("\\p{javaSpaceChar}", "_")); – user85155 Aug 24 '11 at 11:23
1

\p{javaSpaceChar} does not seem to be documented anywhere. – zendu Feb 15 '18 at 09:40
2

@zendu - it is, albeit not very visibly: 1) https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#jcc : > Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname. 2) https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isSpaceChar(char) – Grégory Joseph Feb 18 '18 at 13:06

richardtallent · Answer 4 · 2009-06-29T22:11:35.190

The only time a non-breaking space should be treated specially is with code designed to perform word-wrapping of text.

For all other purposes, including word counts, trimming, and general-purpose splitting along word boundaries, a non-breaking space is still whitespace.

Any argument that a non-breaking space just "looks like" a space but isn't one conflicts with the whole point of Unicode, which represents characters based on their meaning, not how they are displayed.

Thus, IMHO, the Java implementation of String.trim() is not performing as expected, and the underlying Character.isWhitespace() function is at fault.

My guess is that the Java implementors wrote isWhitespace() based on the need to perform text-wrapping within controls. They should have named this function isWordWrappingBoundary() or something more clear, and used a less-restrictive whitespace test for trim().

String.trim() is even more broken than that. It just trims ASCII control characters, and no Unicode whitespace at all, breaking or not. — Thilo, Jun 30 '09 at 01:30

score 6 · Answer 5 · answered Jun 29 '09 at 21:16

6

I would argue that Java's implementation is more correct than .NET's. The non-breaking space is essentially a non-whitespace character that looks like one. That is, if you have the strings "foo" and "bar", and put any traditional whitespace character in between them, you would get a word break. A non-breaking space, however, does not break the two up.

answered Jun 29 '09 at 21:16

Matt Poush

1,263
8
11

5

A non-breaking space is still a word boundary. The "breaking" in "non-breaking space" refers to how it should be interpreted for purposes of **line**-breaking, not word breaks. – richardtallent Jun 29 '09 at 22:20

score 2 · Answer 6 · answered Jun 29 '09 at 21:14

It looks like the method name (isWhitespace) is inconsistent with its function (to detect separators). The "separator" functionality is fairly clear if you look at the full list of characters from the Javadoc page you quoted:

* It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
* It is '\u0009', HORIZONTAL TABULATION.
* It is '\u000A', LINE FEED.
* It is '\u000B', VERTICAL TABULATION.
* It is '\u000C', FORM FEED.
* It is '\u000D', CARRIAGE RETURN.
* It is '\u001C', FILE SEPARATOR.
* It is '\u001D', GROUP SEPARATOR.
* It is '\u001E', RECORD SEPARATOR.
* It is '\u001F', UNIT SEPARATOR.

A non-breaking space's function is supposed to be visual space between words that is not separated by hyphenation algorithms.

score 2 · Answer 7 · edited Aug 18 '21 at 13:42

2

Also be cautious when using the Apache Commons function StringUtils.isBlank() (and related functions) which has the same strange isWhitespace behavior, i.e. a non-breaking space is considered to be non-blank.

edited Aug 18 '21 at 13:42

Peter Mortensen

30,738
21
105
131

answered Jul 20 '11 at 12:13

Maze

726
6
9

Why is non-breaking space not a whitespace character in Java?

7 Answers7

Linked