4

I want to match all expressions with exactly one whitespace. Currently, I'm using [^\\s]*\\s[^\\s]*. That doesn't seem like a very good way, though.

ryyst
  • 9,563
  • 18
  • 70
  • 97
  • See the first answer to http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions for a list of problems regarding unicode spaces in Java. – maaartinus Jan 29 '11 at 21:44

6 Answers6

6

Why not? It's fine, just a bit overcomplicated:

\\S*\\s\\S*
maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • 1
    Won't this also match multiple whitespace characters, with the "\\S*" parts just matching empty strings before and after each white space characters? – Sergei Tachenov Jan 29 '11 at 12:08
  • Not if you use this as is, e.g., `Pattern.compile("\\S*\\s\\S*").matcher(" ").matches()` returns false. What you mean corresponds with "(\\S*\\s\\S*)+" or whatever. – maaartinus Jan 29 '11 at 12:14
  • 1
    Yes, it works okay with the `matches()` method because `matches()` implicitly anchors the match at both ends. But I would go ahead and add the anchors anyway; they express my intent more clearly. – Alan Moore Jan 29 '11 at 18:37
  • Right, but this way you make it unusable as a subpattern. And the OP said "but the regex I was asking for is part of a much larger regex,". – maaartinus Jan 29 '11 at 21:16
  • @Alan: That’s very good advice to put the anchors in explicitly. I always do the same with Java regexes, no matter what interface I use. That said, I usually use `find()` instead of `matches()`. Because that’s what I first started using, I never got tripped up on the weird behavior of `matches()`. I rather like [Russ Cox’s API](http://code.google.com/p/re2/) where he distinguishes `RE2::FullMatch()` from `RE2::PartialMatch()`. Beginners are still confused by things like `123456789 =~ /\d{5}/` testing true, or `say uc "goodfood" =~ s/o*/e/r` producing `"EGOODFOOD"` (from `` :). – tchrist Jan 29 '11 at 22:59
  • No, with anchors it won't work inside another regex, which was requires. What's weird with matches()? It's RE2::FullMatch, isn't it? And find() is RE2::FullMatch. – maaartinus Jan 30 '11 at 00:05
  • 1
    @tchrist: Yeah, I cut my teeth on Perl regexes too. Whenever I read or hear the name of Java's `matches()` method, it's accompanied by a sort of echo in my mind: *...that automatically anchors the match*. @maaartinus: before Java, virtually every regex-powered tool or language defined the word "match" to mean the regex described some **part** of the target text, not necessarily the whole thing. By specifying the `matches()` method the way it did, and making it the *only* regex matching method in the String class, Java has added quite a bit of mud to the water. – Alan Moore Jan 30 '11 at 01:42
  • @maaartinus: You can't say for sure that an anchored regex won't work as a subpattern. It could make up one of several top-level alternatives in a regex that has to match the whole string, as in `"^\\d+$|^\\S*\\s\\S*$|^foo$"`. – Alan Moore Jan 30 '11 at 02:00
  • *You can't say for sure that an anchored regex* - Sure, you can use the fact, that each pattern is a subpattern of itself, if you really want to prove me wrong. IMHO, matches in Java is just right, at least as I understand the word "matches", but my English is not very good. – maaartinus Jan 30 '11 at 02:12
2

I want to match all expressions with exactly one whitespace.

The correct pattern for finding out whether any whitespace occurs in a Java string is:

\A[^\u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]*+[\u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000][\u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]*+\z

The other answers provided here do not correctly answer the question asked.

Here are all the Unicode whitespace characters, along with their ages (meaning, which Unicode release they first appeared in) and their binary properties that are related to spacing issues.

U+0009 CHARACTER TABULATION
    \s \h \pC \p{Cc}
    Age=1.1 HorizSpace Pattern_White_Space Space White_Space
U+000A LINE FEED (LF)
    \s \v \R \pC \p{Cc}
    Age=1.1 Pattern_White_Space Space VertSpace White_Space
U+000B LINE TABULATION 
    \v \R \pC \p{Cc}
    Pattern_White_Space Space VertSpace White_Space 
U+000C FORM FEED (FF)
    \s \v \R \pC \p{Cc}
    Age=1.1 Pattern_White_Space Space VertSpace White_Space
U+000D CARRIAGE RETURN (CR)
    \s \v \R \pC \p{Cc}
    Age=1.1 Pattern_White_Space Space VertSpace White_Space
U+0020 SPACE
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Pattern_White_Space Space Space_Separator White_Space
U+0085 NEXT LINE (NEL)
    \s \v \R \pC \p{Cc}
    Age=1.1 Pattern_White_Space Space VertSpace White_Space
U+00A0 NO-BREAK SPACE
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Space Space_Separator White_Space
U+1680 OGHAM SPACE MARK
    \s \h \pZ \p{Zs}
    Age=3.0 HorizSpace Space Space_Separator White_Space
U+180E MONGOLIAN VOWEL SEPARATOR
    \s \h \pZ \p{Zs}
    Age=3.0 HorizSpace Space Space_Separator White_Space
U+2000 EN QUAD
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Space Space_Separator White_Space
U+2001 EM QUAD
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Space Space_Separator White_Space
U+2002 EN SPACE
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Space Space_Separator White_Space
U+2003 EM SPACE
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Space Space_Separator White_Space
U+2004 THREE-PER-EM SPACE
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Space Space_Separator White_Space
U+2005 FOUR-PER-EM SPACE
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Space Space_Separator White_Space
U+2006 SIX-PER-EM SPACE
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Space Space_Separator White_Space
U+2007 FIGURE SPACE
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Space Space_Separator White_Space
U+2008 PUNCTUATION SPACE
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Space Space_Separator White_Space
U+2009 THIN SPACE
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Space Space_Separator White_Space
U+200A HAIR SPACE
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Space Space_Separator White_Space
U+2028 LINE SEPARATOR
    \s \v \R \pZ \p{Zl}
    Age=1.1 Pattern_White_Space Space VertSpace White_Space
U+2029 PARAGRAPH SEPARATOR
    \s \v \R \pZ \p{Zp}
    Age=1.1 Pattern_White_Space Space VertSpace White_Space
U+202F NARROW NO-BREAK SPACE
    \s \h \pZ \p{Zs}
    Age=3.0 HorizSpace Space Space_Separator White_Space
U+205F MEDIUM MATHEMATICAL SPACE
    \s \h \pZ \p{Zs}
    Age=3.2 HorizSpace Space Space_Separator White_Space
U+3000 IDEOGRAPHIC SPACE
    \s \h \pZ \p{Zs}
    Age=1.1 HorizSpace Space Space_Separator White_Space

Note that all but four were present ever since way way way back in Unicode 1.1. U+1680 OGHAM SPACE MARK, U+180E MONGOLIAN VOWEL SEPARATOR, and U+202F NARROW NO-BREAK SPACE entered The Unicode Standard with release 3.0, and U+205F MEDIUM MATHEMATICAL SPACE first appeared with the 3.2 release. There have been no more added since that time.

The \p{Whitespace} property is required for compliance with UTS#18 RL1.2 “Properties”, and the \p{space} alias and the \s shortcut for whitespace are both required for compliance with UTS#18 RL1.2a “Compatibility Properties”.

As explained in The Unicode Standard 6.0.0’s Conformance document, the White_Space property is a normative property, not an informative, contributatory, or provisional property. Because it is a normative property, you are strictly required to use these values to correctly process all Unicode character data according to The Unicode Standard.

Nothing in j.u.r.Pattern provides functionality conformant with The Unicode Standard in this regard. In fact, Java’s regexes fail to meet half the mandatory requirements necessary for even the very lowest possible level of compliance set forth in UTS #18: Unicode Regular Expressions. That minimum level is Level 1, about which is written:

Level 1 is the minimally useful level of support for Unicode. All regex implementations dealing with Unicode should be at least at Level 1.

Because Java’s regexes fail to meet even these very barest of minimal requirements indispensable for dealing with Unicode, Java’s regexes are not minimally useful for dealing with Unicode. You must therefore resort such explicit enumerations as given above if you hope to produce conformant behaviour. You might care to consider using my pattern-rewriting library.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • They do, just your definition of whitespace differs. Your may be better, but do you think I care about \u1680? How do you know, you missed none? Maybe they added a new one while you were writing this? However, +1 for pointing to this. – maaartinus Jan 29 '11 at 21:24
  • @maaartinus: My definition of whitespace is that of the **The Unicode Standard 6.0.0**. I know I could not have missed one because I am well-acquainted with the Unicode Character Database. I am also familiar with the stability policy of identifiers: *All strings that are valid default Unicode identifiers will continue to be valid default Unicode identifiers in all subsequent versions of Unicode. Furthermore, default identifiers never contain characters with the Pattern_Syntax or Pattern_White_Space properties.* I also know that the Pattern_White_Space property is guaranteed never to change. – tchrist Jan 29 '11 at 21:36
  • I see you know it, I'm just reading a very long answer by you. However, I'm not going to this lengthy workaround about wrong Java implementation of Unicode classes unless threatened with a gun. Rewriting j.u.r.Pattern would be way easier than using such expressions. – maaartinus Jan 29 '11 at 21:53
  • @maaartinus: Some of us have no choice in whether to conform with the standard or not. We simply must do so. Either we do what the standard requires, or else we fail. – tchrist Jan 29 '11 at 22:31
0

Another way to do it, if you don't want to go the regex way (possible performance increase):

String s = "one whitespace";


public boolean hasOneWhitespace(String s) {
   int count = 0;
   for (int i = 0; i < s.length(); i++) {
      if(s.charAt(i) == ' ') {
         count++;
         if (count > 1) return false;
      }
   }
   return count == 1;   
}

Of course, this will work only if you consider " " to be whitespace. Tabs and newlines won't work.

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
darioo
  • 46,442
  • 10
  • 75
  • 103
  • Thanks, but the regex I was asking for is part of a much larger regex, so it has to be a regex. I know regex matching is a performance whore, but it's so much easier to program! – ryyst Jan 29 '11 at 11:46
  • 1
    This code is not optimal, for a long string containing two spaces on the beginning it performs needless work. I don't think that regexes are that bad, there are a lot of optimization therein. – maaartinus Jan 29 '11 at 11:53
  • 1
    Please change that last line to `return count == 1;` – whiskeysierra Jan 29 '11 at 12:11
  • 1
    And instead of checking for spaces, you could use `Charachter.isWhitespace(c)`. – whiskeysierra Jan 29 '11 at 12:14
  • @Willi: **NO!** do not use `Character.isWhitespace(int cp)`, because it does not accord with the normative Unicode property, `\p{White_Space}`. Read: it’s broken!! – tchrist Jan 29 '11 at 23:02
0

You could also check it with indexOf:

String s = "some text";
int indexOf = s.indexOf(' ');
boolean isOneWhitespace = (indexOf >= 0 && indexOf == s.lastIndexOf(' '));
morja
  • 8,297
  • 2
  • 39
  • 59
0

Use transliterate. It has to be an independent test, the regex you have above cannot be combined with a larger regex and still test for a single whitespace.

Transliterate is 10-20 times faster than a regex for this test.
This is a jtr example:

String aInput = "This is a test, 123.";
CharacterReplacer cReplacer = Perl5Parser.makeReplacer( "tr[ \\t\\r\\n\\f\\x0B][ \\t\\r\\n\\f\\x0B]" );
String aResult = cReplacer.doReplacement( aInput );
int nMatches = cReplacer.getMatches();

if (nMatches == 1) { ... }
0
String[] ss = { " ", "abc", "a bc", "a b c d" };
Matcher m = Pattern.compile("^\\S*\\s\\S*$").matcher("");
for (String s : ss)
{
  if (m.reset(s).matches())
  {
    System.out.printf("%n>>%s<< OK%n", s);
  }
}

output:

>> << OK

>>a bc<< OK
Alan Moore
  • 73,866
  • 12
  • 100
  • 156