How to check if a character is a non-word boundary

Question

In Java regular expression, it has "\B" as a non-word boundary.

https://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

If I have a 'char', how can I check it is a non-word boundary?

Thank you.

The "boundary" in question is an anchor: a position between (or before/after) characters, and not a character in itself (similar to how `^` doesn't refer to a character, it refers to the position before the first character). So the question itself is a bit meaningless, you might need to clarify so we know exactly what you want. — Mark Peters, Jun 02 '10 at 21:12

score 7 · Answer 1 · edited Jun 20 '20 at 09:12

7

The boundary has a special meaning. It has actually a zero-length match and can therefore not be matched on a single character. It is used to determine the position between a non-word char and a word-char. Also see http://regular-expressions.info/wordboundaries.html.

I however understood that this question is more whether the given char can possibly denote the start or end of a word boundary. From the javadoc which you linked (here is the latest version):

Predefined character classes

. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]

So, a word character matches \w. A non-word character matches \W. So:

String string = String.valueOf(yourChar);
boolean nonWordCharacter = string.matches("\\W");

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 02 '10 at 21:06

BalusC

1,082,665
372
3,610
3,555

3

Note: this doesn't tell you if it's a boundary, just that it's a non-word char. The concept of a boundary is relevant to an ordered collection and can not be reasonably applied to a single char. – jball Jun 02 '10 at 21:09
Further clarification, boundary is a context specific term, and examining only a char removes the context used for the `"\B"` regex. – jball Jun 02 '10 at 21:12
1

Indeed, the boundary has a special meaning. It has actually a zero-length match. Also see http://regular-expressions.info/wordboundaries.html This is actually used to determine the position between a non-word char and a word-char. I however understood that his question was more whether the given char can possibly denote the start or end of a word boundary. – BalusC Jun 02 '10 at 21:25
1

I'd add that last comment to your original question to emphasize the fact that `\b` and `\B` don't match a character but a position, since that is what michael is confused about. – Bart Kiers Jun 03 '10 at 07:15

score 2 · Answer 2 · answered Jun 03 '10 at 07:12

The question is very peculiar, but it's true that a \w on its own is surrounded by \b. Similarly, a \W on its own is surrounded by \B. So for the purpose of word boundary definitions, ^ and $ are non-word characters.

    System.out.println("a".matches("^\\b\\w\\b$")); // true
    System.out.println("a".matches("^\\b\\w\\B$")); // false
    System.out.println("a".matches("^\\B\\w\\b$")); // false
    System.out.println("a".matches("^\\B\\w\\B$")); // false

    System.out.println("@".matches("^\\b\\W\\b$")); // false
    System.out.println("@".matches("^\\b\\W\\B$")); // false
    System.out.println("@".matches("^\\B\\W\\b$")); // false
    System.out.println("@".matches("^\\B\\W\\B$")); // true

    System.out.println("".matches("$$$$\\B\\B\\B\\B^^^")); // true

The last line may be surprising, but such is the nature of anchors.

What you say is not true: all that stuff is really broken in Java. If you compile up a pattern like `\b\w+\b` and use the Matcher#find method against the string `élève`, you will not find **any matches whatsoever**. Java regexes are super-extremely broken. See [this answer](http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261) for why, *and* what you can do about it. — tchrist, Dec 02 '10 at 03:15
Fantastic explanation thanks - that really makes it much clearer how this works. — Penelope The Duck, Apr 18 '13 at 21:06

score 1 · Answer 3 · answered Jun 02 '10 at 21:09

1

((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))

or if you want to digits to be also parts of a word:

((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9'))

answered Jun 02 '10 at 21:09

zed_0xff

32,417
7
53
72

score 1 · Answer 4 · answered Jun 02 '10 at 21:13

1

A boundary is a position between two characters, so a character can never be a boundary.

If you want to match a character that is not surrounded by word boundaries, e. g. the character b in abc, then you can use

\B.\B

Remember to escape the backslashes in a Java string, as in

Pattern regex = Pattern.compile("\\B.\\B");

answered Jun 02 '10 at 21:13

Tim Pietzcker

328,213
58
503
561

In practice, it's fine to define boundaries as something that exists only between two characters. However, it's actually more liberal than that, at least in Java. See my answer. – polygenelubricants Jun 03 '10 at 07:17

score 0 · Answer 5 · edited May 23 '17 at 12:07

0

Check this answer for a discussion of just what exactly a \b boundary is and how to wrestle your regex into behaving more the way you may want it to.

edited May 23 '17 at 12:07

Community

1
1

answered Nov 18 '10 at 13:37

tchrist

78,834
30
123
180

How to check if a character is a non-word boundary

5 Answers5

See also