2

In Java regular expression, it has "\B" as a non-word boundary.

https://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

If I have a 'char', how can I check it is a non-word boundary?

Thank you.

fospathi
  • 537
  • 1
  • 6
  • 7
michael
  • 106,540
  • 116
  • 246
  • 346
  • 1
    The "boundary" in question is an anchor: a position between (or before/after) characters, and not a character in itself (similar to how `^` doesn't refer to a character, it refers to the position before the first character). So the question itself is a bit meaningless, you might need to clarify so we know exactly what you want. – Mark Peters Jun 02 '10 at 21:12

5 Answers5

7

The boundary has a special meaning. It has actually a zero-length match and can therefore not be matched on a single character. It is used to determine the position between a non-word char and a word-char. Also see http://regular-expressions.info/wordboundaries.html.

I however understood that this question is more whether the given char can possibly denote the start or end of a word boundary. From the javadoc which you linked (here is the latest version):

Predefined character classes

. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]

So, a word character matches \w. A non-word character matches \W. So:

String string = String.valueOf(yourChar);
boolean nonWordCharacter = string.matches("\\W");
Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • 3
    Note: this doesn't tell you if it's a boundary, just that it's a non-word char. The concept of a boundary is relevant to an ordered collection and can not be reasonably applied to a single char. – jball Jun 02 '10 at 21:09
  • Further clarification, boundary is a context specific term, and examining only a char removes the context used for the `"\B"` regex. – jball Jun 02 '10 at 21:12
  • 1
    Indeed, the boundary has a special meaning. It has actually a zero-length match. Also see http://regular-expressions.info/wordboundaries.html This is actually used to determine the position between a non-word char and a word-char. I however understood that his question was more whether the given char can possibly denote the start or end of a word boundary. – BalusC Jun 02 '10 at 21:25
  • 1
    I'd add that last comment to your original question to emphasize the fact that `\b` and `\B` don't match a character but a position, since that is what michael is confused about. – Bart Kiers Jun 03 '10 at 07:15
2

The question is very peculiar, but it's true that a \w on its own is surrounded by \b. Similarly, a \W on its own is surrounded by \B. So for the purpose of word boundary definitions, ^ and $ are non-word characters.

    System.out.println("a".matches("^\\b\\w\\b$")); // true
    System.out.println("a".matches("^\\b\\w\\B$")); // false
    System.out.println("a".matches("^\\B\\w\\b$")); // false
    System.out.println("a".matches("^\\B\\w\\B$")); // false

    System.out.println("@".matches("^\\b\\W\\b$")); // false
    System.out.println("@".matches("^\\b\\W\\B$")); // false
    System.out.println("@".matches("^\\B\\W\\b$")); // false
    System.out.println("@".matches("^\\B\\W\\B$")); // true

    System.out.println("".matches("$$$$\\B\\B\\B\\B^^^")); // true

The last line may be surprising, but such is the nature of anchors.

See also

polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
  • What you say is not true: all that stuff is really broken in Java. If you compile up a pattern like `\b\w+\b` and use the Matcher#find method against the string `élève`, you will not find **any matches whatsoever**. Java regexes are super-extremely broken. See [this answer](http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261) for why, *and* what you can do about it. – tchrist Dec 02 '10 at 03:15
  • Fantastic explanation thanks - that really makes it much clearer how this works. – Penelope The Duck Apr 18 '13 at 21:06
1
((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))

or if you want to digits to be also parts of a word:

((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9'))
zed_0xff
  • 32,417
  • 7
  • 53
  • 72
1

A boundary is a position between two characters, so a character can never be a boundary.

If you want to match a character that is not surrounded by word boundaries, e. g. the character b in abc, then you can use

\B.\B

Remember to escape the backslashes in a Java string, as in

Pattern regex = Pattern.compile("\\B.\\B");
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • In practice, it's fine to define boundaries as something that exists only between two characters. However, it's actually more liberal than that, at least in Java. See my answer. – polygenelubricants Jun 03 '10 at 07:17
0

Check this answer for a discussion of just what exactly a \b boundary is and how to wrestle your regex into behaving more the way you may want it to.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180