3

I am having some issue with regular expression. I am testing wit case 1

\b(water|watering)\b/g

the above expression can match "water watering" successfully.

But if I added a hyphen in between for case 2:

\b(water|water-ing)\b/g

It can't match the water-ing in "water water-ing".
It only works if I move "water-ing" expression to the front, as in case 3:

\b(water-ing|water)\b/g

But I wish to find out if there is any solution for the case number 2, without modifying the sequence of capturing groups.

Here is the reference: https://regex101.com/r/kR1bL0/2

neobie
  • 2,847
  • 5
  • 27
  • 31
  • 3
    This is happening because of the `-` in `water-ing` and word boundary `\b`. – Tushar May 05 '16 at 03:21
  • You cannot achieve what you want without either reordering or changing the first branch pattern. What I mean is: 1) [`/\b(water-ing|water)\b/g`](https://regex101.com/r/pM4mV7/1) or [`/\bwater(?:-ing)?\b/g`](https://regex101.com/r/pM4mV7/2), or 2) [`/\b(water(?!-)|water-ing)\b/g`](https://regex101.com/r/pM4mV7/4). – Wiktor Stribiżew May 05 '16 at 06:44

4 Answers4

2

You can do this:

\b(water-ing|water)\b/g

https://regex101.com/r/fC8wO1/1

Because "water" is inside "water-ing" you have to put first "water-ing" and if the regex can't find it, the it try to find "water".

Or you can do this:

\b(water(?:-ing)?)\b/g

It is important to use "?:" to avoid create another group with the "()".

https://regex101.com/r/yC8uM2/3

Troncador
  • 3,356
  • 3
  • 23
  • 40
  • The outer parentheses are not beneficial when avoiding alternation because that capture group is identical to the fullstring match. – mickmackusa Nov 15 '22 at 10:26
2

Note About Alternation

In alternation, every alternative is checked at current position in the string until one of the alternation succeeds or all of them fails.

Case I

Your string is

water watering

Your regex is

/\b(water|watering)\b/g

i) First of all, first alternation is checked like \bwater. It succeeds and water is matched because there is a space after water in water watering which serves as end word boundary.

ii) Due to g flag again a match is performed. So string watering is being tried to match with \bwater\b(along with word boundary in end) but it fails because there is i in watering after water which is not word boundary. Then second alternation is checked i.e. \bwatering and it succeeds because there is end of string which serves as word boundary for \bwatering\b in last.

Case II

Your string is

water water-ing

For regex

/\b(water|water-ing)\b/g

i) Same as Step I of Case I

Now string upto water is consumed and our checking position is blank space before watering

water water-ing
    ^^
    || 

ii) Again a checking is performed due to g flag. First alternation is tried with \bwater. The position now is - just after r and before i

water water-ing
          ^^
          || 

Quoting from here about word boundary

A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ( [0-9A-Za-z_] ). The dash is not a word character.

So - acts as a word boundary and \bwater\b is matched in water-ing

enter image description here

Case III

For regex

/\b(water-ing|water)\b/g

i) First alternation \bwater-ing is checked in the string but it does not matches the string water. Again, second alternation \bwater is checked and it succeeds because there is a space after water in the string.

ii) First alternation \bwater-ing is checked in the string which is present. The string ends with this word water-ing. So end of string($) acts as word boundary. and match succeeds.

enter image description here enter image description here

What's the solution?

i) If there is overlapping regex, keep the longest one in starting and so on as you used in your last solution

ii) You can use negative lookahead like

\b(water(?!-)|water-ing)\b

It seems Wiktor has already suggested four solutions. You can use any of them

Community
  • 1
  • 1
rock321987
  • 10,942
  • 1
  • 30
  • 43
0

Different regular expression engines define different character sets for a "word boundary". For example, ECMAScript specifies a word character as one of 63 characters, and - is not listed there. So - is considered as a word boundary in ECMAScript.

Obviously, \b is not suitable for Unicode words. So you should use your own set of characters which are supposed to be word boundaries.

For example, in PHP you might use the following:

preg_match_all('/[\p{L}\-]+/u', 'water water-ing', $m);
var_dump($m);
/*
array(1) {
  [0]=>
  array(2) {
    [0]=>
    string(5) "water"
    [1]=>
    string(9) "water-ing"
  }
}
*/

where \p{L} stands for a Unicode "letter" category. See PHP Unicode character properties

Ruslan Osmanov
  • 20,486
  • 7
  • 46
  • 60
0

you can use this: \b(water(ing)?)\g