What is a word boundary in regex?

Question

I'm trying to use regexes to match space-separated numbers. I can't find a precise definition of \b ("word boundary"). I had assumed that -12 would be an "integer word" (matched by \b\-?\d+\b) but it appears that this does not work. I'd be grateful to know of ways of .

[I am using Java regexes in Java 1.6]

Example:

Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());

String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());

pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());

This returns:

true
false
true

Can you post a small example with input and expected output? — Brent Writes Code, Aug 24 '09 at 20:52
Example Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*"); String plus = " 12 "; System.out.println(""+pattern.matcher(plus).matches()); String minus = " -12 "; System.out.println(""+pattern.matcher(minus).matches()); pattern = Pattern.compile("\\s*\\-?\\d+\\s*"); System.out.println(""+pattern.matcher(minus).matches()); gives: true false true — peter.murray.rust, Aug 24 '09 at 21:06

score 160 · Accepted Answer · edited Jul 04 '12 at 21:40

160

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).

So, in the string "-12", it would match before the 1 or after the 2. The dash is not a word character.

edited Jul 04 '12 at 21:40

Gilles 'SO- stop being evil'

104,111
38
209
254

answered Aug 24 '09 at 21:00

brianary

8,996
2
35
29

52

Correctamundo. `\b` is a zero-width assertion that matches if there is `\w` on one side, and either there is `\W` on the other or the position is beginning or end of string. `\w` is arbitrarily defined to be "identifier" characters (alnums and underscore), not as anything especially useful for English. – hobbs Aug 24 '09 at 21:02
1

100% correct. Apologies for not just commenting on yours. I hit submit before I saw your answer. – Brent Writes Code Aug 24 '09 at 21:05
6

for the sake of understanding, is it possible to rewrite the regex `\bhello\b` without using `\b` (using `\w`, `\W` and other)? – David Portabella Sep 28 '16 at 09:40
6

Sort of: `(^|\W)hello($|\W)`, except that it wouldn't capture any non-word characters before and after, so it would be more like `(^|(?<=\W))hello($|(?=\W))` (using lookahead/lookbehind assertions). – brianary Sep 28 '16 at 09:58
10

@brianary Slightly simpler: `(?<!\w)hello(?!\w)`. – David Knipe Nov 19 '17 at 17:16
Is there such a thing as a character boundry in regex? For eg `\b` Would be at the beginig and end of the word `hello`, but not after each character – Luke T O'Brien Jul 25 '20 at 07:59
@LukeTO'Brien Most implementations of the `split` function will separate a string into characters with an empty string for the regex value. Otherwise you may be able to use `(\b|\B)` (`\B` is the opposite of `\b`). – brianary Jul 25 '20 at 15:34
@hobbs (13 years later...) - _"`\w` is arbitrarily defined to be "identifier" characters (alnums and underscore), not as anything especially useful for English"_ - yes, but only in the original EMCAScript RegExp, whereas in other engines/environments (.NET's `Regex` or Java's `UNICODE_CHARACTER_CLASS`, or ECMAScript/JavaScript with the `u` flag) then `\w` matches `[\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}]` instead of `[a-zA-Z_0-9]`. – Dai Jul 03 '22 at 13:31

score 74 · Answer 2 · answered Jun 01 '18 at 01:19

74

In the course of learning regular expression, I was really stuck in the metacharacter which is \b. I indeed didn't comprehend its meaning while I was asking myself "what it is, what it is" repetitively. After some attempts by using the website, I watch out the pink vertical dashes at the every beginning of words and at the end of words. I got it its meaning well at that time. It's now exactly word(\w)-boundary.

My view is merely to immensely understanding-oriented. Logic behind of it should be examined from another answers.

answered Jun 01 '18 at 01:19

Soner from The Ottoman Empire

18,731
3
79
101

7

A very good site to understand what is a word boundary and how matches are happening – vsingh Oct 23 '19 at 14:19
15

This post deserves credit for showing instead of telling. A picture's worth a thousand words. – M_M Apr 02 '20 at 14:33

score 37 · Answer 3 · edited Jul 19 '16 at 00:15

37

A word boundary can occur in one of three positions:

Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.

Word characters are alpha-numeric; a minus sign is not. Taken from Regex Tutorial.

edited Jul 19 '16 at 00:15

SongWithoutWords

471
5
12

answered Aug 24 '09 at 21:05

WolfmanDragon

7,851
14
49
61

Quicl example: consider text `this is a bad c+a+t` and if pattern is `\ba` then it will match this is `a` bad c+`a`+t – maq Jan 26 '22 at 22:21

Daksh Gargas · Answer 4 · 2022-12-06T04:55:35.930

I would like to explain Alan Moore's answer

A word boundary is a position that is either preceded by a word character and not followed by one or followed by a word character and not preceded by one.

Suppose I have a string "This is a cat, and she's awesome", and I want to replace all occurrences of the letter 'a' only if this letter ('a') exists at the "Boundary of a word",

In other words: the letter a inside 'cat' should not be replaced.

So I'll perform regex (in Python) as

re.sub(r"\ba","e", myString.strip()) //replace a with e

Therefore,

Input; Output

This is a cat and she's awesome

This is e cat end she's ewesome

score 17 · Answer 5 · answered Aug 25 '09 at 01:36

17

A word boundary is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.

answered Aug 25 '09 at 01:36

Alan Moore

73,866
12
100
156

9

Am I only the guy feeling like solving a puzzle as reading the answer, even after years? – Soner from The Ottoman Empire Oct 17 '20 at 11:24
@snr Please refer to this: https://stackoverflow.com/a/54629773/8164116 :) – Daksh Gargas Mar 09 '21 at 08:58
@DakshGargas He shouldn't have given rise to be given birth to a new post straightening out the intricate one. – Soner from The Ottoman Empire Mar 09 '21 at 09:02
3

I was going through a minimalist phase when I wrote that. – Alan Moore Mar 23 '21 at 07:09

score 11 · Answer 6 · edited May 23 '17 at 12:34

11

I talk about what \b-style regex boundaries actually are here.

The short story is that they’re conditional. Their behavior depends on what they’re next to.

# same as using a \b before:
(?(?=\w) (?<!\w)  | (?<!\W) )

# same as using a \b after:
(?(?<=\w) (?!\w)  | (?!\W)  )

Sometimes that isn’t what you want. See my other answer for elaboration.

edited May 23 '17 at 12:34

Community

1
1

answered Nov 18 '10 at 13:35

tchrist

78,834
30
123
180

score 7 · Answer 7 · edited Aug 10 '16 at 09:24

I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.

Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for \b but not for \w. (I'm sure there was a good reason for it at the time).

The \w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. But Java, JavaScript, and PCRE match only ASCII characters with \w.

Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the \b.

Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.

Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the \b with before and after whitespace and punctuation designators. For example:

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("\\n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "\n" + line;
        }
    }
    return result.trim();
}

Then in your test or main function:

    String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";   
    String afterWord =  "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!

I struggled trying to understand why I couldn't match `C#` but now it's clearer — Mugoma J. Okomba, Dec 06 '16 at 19:48

score 4 · Answer 8 · answered Aug 24 '09 at 21:03

Check out the documentation on boundary conditions:

http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html

Check out this sample:

public static void main(final String[] args)
    {
        String x = "I found the value -12 in my string.";
        System.err.println(Arrays.toString(x.split("\\b-?\\d+\\b")));
    }

When you print it out, notice that the output is this:

[I found the value -, in my string.]

This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like @brianary kinda beat me to the punch, so he gets an up-vote.

score 4 · Answer 9 · answered Oct 17 '20 at 15:28

4

Reference: Mastering Regular Expressions (Jeffrey E.F. Friedl) - O'Reilly

\b is equivalent to (?<!\w)(?=\w)|(?<=\w)(?!\w)

answered Oct 17 '20 at 15:28

user4779

645
5
14

This is a great explanation and makes it obvious how to get only the "beginning of word" or "end of word" part of it (but not both). – jlh Jan 17 '21 at 14:34

score 2 · Answer 10 · answered Nov 08 '18 at 10:38

2

Word boundary \b is used where one word should be a word character and another one a non-word character. Regular Expression for negative number should be

--?\b\d+\b

check working DEMO

answered Nov 08 '18 at 10:38

AnubhavShakya

21
3

score 1 · Answer 11 · answered Aug 24 '09 at 20:59

I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.

One possible alternative is

(?:(?:^|\s)-?)\d+\b

This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.

vic · Answer 12 · 2017-11-19T18:53:13.420

0

when you use \\b(\\w+)+\\b that means exact match with a word containing only word characters ([a-zA-Z0-9])

in your case for example setting \\b at the begining of regex will accept -12(with space) but again it won't accept -12(without space)

for reference to support my words: https://docs.oracle.com/javase/tutorial/essential/regex/bounds.html

edited Nov 19 '17 at 18:53

answered Nov 19 '17 at 16:41

vic

29
9

score -1 · Answer 13 · answered Aug 24 '09 at 20:55

-1

I think it's the boundary (i.e. character following) of the last match or the beginning or end of the string.

answered Aug 24 '09 at 20:55

2

You're thinking of `\G`: matches the beginning of the string (like `\A`) on the first match attempt; after that it matches the position where the previous match ended. – Alan Moore Jun 24 '16 at 20:50

What is a word boundary in regex?

13 Answers13

Input; Output

Linked

Related