1

I am using this regex to catch the "e"s at the end of a string.

e\b|e[!?.:;]

It works but the thing I don't understand, when this encounters an input like

"space."

It only takes the "e", not including the "." but the regex has [!?.:;], which suggests it should capture the dot also.

If I remove the e\b| in the beginning, it captures the dot too. This is no problem for me because I was already trying to capture the letter only, however, I need this behavior to be explained.

Maroun
  • 94,125
  • 30
  • 188
  • 241
Rockybilly
  • 2,938
  • 1
  • 13
  • 38

2 Answers2

1

The regex engine stops searching as soon as it finds a valid match.

The order of the alternatives matters, and since e is first matched, the engine will stop looking for the right side of the alternation.

In your case, the regex engine starts at the first token in "space.", it doesn't match. Then it moves to the second one, the "p". It still doesn't match.. It keeps trying to match tokens until it finally reaches the "e", and matches the left side of the alternation - when this happens, it doesn't proceed since a match was found.

I highly advise you to go through this tutorial, it gives a very good explanation on that.

Maroun
  • 94,125
  • 30
  • 188
  • 241
1

If you need to make sure the . is returned in the match, just swap the alternatives:

e[!?.:;]|e\b

In NFA regex, the first alternative matched wins. There are also some different aspects here to consider, too, but this is out of scope here.

More details can be found here:

In this case, here is what is going on: \b after e requires a non-word character after it. Since . is a non-word character, it satisfies the condition, that is why e\b (being the first alternative branch) wins with e[!?.:;] as both are able to match the same substring at that location.

enter image description here

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563