33

While searching for a proper way to trim non-breaking space from parsed HTML, I've first stumbled on Java's spartan definition of String.trim() which is at least properly documented. I wanted to avoid explicitly listing characters eligible for trimming, so I assumed that using Unicode backed methods on the Character class would do the job for me.

That's when I discovered that Character.isWhitespace(char) explicitly excludes non-breaking spaces:

It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').

Why is that?

The implementation of corresponding .NET equivalent is less discriminating.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Palimondo
  • 7,281
  • 4
  • 39
  • 58

7 Answers7

23

Character.isWhitespace(char) is old. Really old. Many things done in the early days of Java followed conventions and implementations from C.

Now, more than a decade later, these things seem erroneous. Consider it evidence how far things have come, even between the first days of Java and the first days of .NET.

Java strives to be 100% backward compatible. So even if the Java team thought it would be good to fix their initial mistake and add non-breaking spaces to the set of characters that returns true from Character.isWhitespace(char), they can't, because there almost certainly exists software that relies on the current implementation working exactly the way it does.

Steve McLeod
  • 51,737
  • 47
  • 128
  • 184
  • 3
    Regarding backward compatibility: I agree, but there is no reason why not to add, say, Character.isWhitespaceNew(char) to capture the current situation. – Jirka Oct 15 '12 at 18:42
  • 7
    And down the other road lies, well, Java. A language that blazed the trail for those that followed (who learned from its mistakes), but why anyone would voluntarily use it if they had other options is beyond my comprehension. – Eloff May 10 '13 at 14:17
  • It is still in the language because of backward compatibility, but it doesn't explain why it is originally like that. – Verneri Åberg Dec 20 '13 at 10:25
  • 2
    @Jirka well, they did add it, except it's called Character.isSpaceChar(char); it doesn't include line breaks though – aditsu quit because SE is EVIL Sep 04 '17 at 13:01
16

Since Java 5 there is also an isSpaceChar(int) method. Does that not do what you want?

Determines if the specified character (Unicode code point) is a Unicode space character. A character is considered to be a space character if and only if it is specified to be a space character by the Unicode standard. This method returns true if the character's general category type is any of the following: ...

Jesper
  • 202,709
  • 46
  • 318
  • 350
14

As posted above, isSpaceChar(int) will provide the OP with a track to the answer. It seems fairly discreetly documented, but this method is actually useable with regexes. So:

    "X\u00A0X X".replaceAll("\\p{javaSpaceChar}", "_");

will produce a "X_X_X" string. It is left as an exercise for the reader to come up with the regex to trim a string. (Pattern with some flags should do the trick.)

törzsmókus
  • 1,799
  • 2
  • 21
  • 28
Grégory Joseph
  • 1,549
  • 16
  • 14
  • Works greate, needs extra " -> "X\u00A0XX".replaceAll("\\p{javaSpaceChar}", "_")); – user85155 Aug 24 '11 at 11:23
  • 1
    \p{javaSpaceChar} does not seem to be documented anywhere. – zendu Feb 15 '18 at 09:40
  • 2
    @zendu - it is, albeit not very visibly: 1) https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#jcc : > Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname. 2) https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isSpaceChar(char) – Grégory Joseph Feb 18 '18 at 13:06
7

The only time a non-breaking space should be treated specially is with code designed to perform word-wrapping of text.

For all other purposes, including word counts, trimming, and general-purpose splitting along word boundaries, a non-breaking space is still whitespace.

Any argument that a non-breaking space just "looks like" a space but isn't one conflicts with the whole point of Unicode, which represents characters based on their meaning, not how they are displayed.

Thus, IMHO, the Java implementation of String.trim() is not performing as expected, and the underlying Character.isWhitespace() function is at fault.

My guess is that the Java implementors wrote isWhitespace() based on the need to perform text-wrapping within controls. They should have named this function isWordWrappingBoundary() or something more clear, and used a less-restrictive whitespace test for trim().

richardtallent
  • 34,724
  • 14
  • 83
  • 123
  • 6
    String.trim() is even more broken than that. It just trims ASCII control characters, and no Unicode whitespace at all, breaking or not. – Thilo Jun 30 '09 at 01:30
6

I would argue that Java's implementation is more correct than .NET's. The non-breaking space is essentially a non-whitespace character that looks like one. That is, if you have the strings "foo" and "bar", and put any traditional whitespace character in between them, you would get a word break. A non-breaking space, however, does not break the two up.

Matt Poush
  • 1,263
  • 8
  • 11
  • 5
    A non-breaking space is still a word boundary. The "breaking" in "non-breaking space" refers to how it should be interpreted for purposes of **line**-breaking, not word breaks. – richardtallent Jun 29 '09 at 22:20
2

It looks like the method name (isWhitespace) is inconsistent with its function (to detect separators). The "separator" functionality is fairly clear if you look at the full list of characters from the Javadoc page you quoted:

* It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
* It is '\u0009', HORIZONTAL TABULATION.
* It is '\u000A', LINE FEED.
* It is '\u000B', VERTICAL TABULATION.
* It is '\u000C', FORM FEED.
* It is '\u000D', CARRIAGE RETURN.
* It is '\u001C', FILE SEPARATOR.
* It is '\u001D', GROUP SEPARATOR.
* It is '\u001E', RECORD SEPARATOR.
* It is '\u001F', UNIT SEPARATOR. 

A non-breaking space's function is supposed to be visual space between words that is not separated by hyphenation algorithms.

Jason S
  • 184,598
  • 164
  • 608
  • 970
2

Also be cautious when using the Apache Commons function StringUtils.isBlank() (and related functions) which has the same strange isWhitespace behavior, i.e. a non-breaking space is considered to be non-blank.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Maze
  • 726
  • 6
  • 9