Alternation usage creates strange behavior

Question

I am using this regex to catch the "e"s at the end of a string.

e\b|e[!?.:;]

It works but the thing I don't understand, when this encounters an input like

"space."

It only takes the "e", not including the "." but the regex has [!?.:;], which suggests it should capture the dot also.

If I remove the e\b| in the beginning, it captures the dot too. This is no problem for me because I was already trying to capture the letter only, however, I need this behavior to be explained.

Thank you, actually the problem was me not using global option, so my first assumptions failed. Otherwise, I would have spotted the priority of alternation. — Rockybilly, Mar 14 '16 at 12:35

score 1 · Answer 1 · answered Mar 14 '16 at 12:34

The regex engine stops searching as soon as it finds a valid match.

The order of the alternatives matters, and since e is first matched, the engine will stop looking for the right side of the alternation.

In your case, the regex engine starts at the first token in "space.", it doesn't match. Then it moves to the second one, the "p". It still doesn't match.. It keeps trying to match tokens until it finally reaches the "e", and matches the left side of the alternation - when this happens, it doesn't proceed since a match was found.

I highly advise you to go through this tutorial, it gives a very good explanation on that.

Or just read my 2 recent answers (links provided in my answer). — Wiktor Stribiżew, Mar 14 '16 at 12:38
@WiktorStribiżew Your answers are detailed and well explained, well done. — Maroun, Mar 14 '16 at 12:40

score 1 · Answer 2 · edited May 23 '17 at 11:45

If you need to make sure the . is returned in the match, just swap the alternatives:

e[!?.:;]|e\b

In NFA regex, the first alternative matched wins. There are also some different aspects here to consider, too, but this is out of scope here.

More details can be found here:

In this case, here is what is going on: \b after e requires a non-word character after it. Since . is a non-word character, it satisfies the condition, that is why e\b (being the first alternative branch) wins with e[!?.:;] as both are able to match the same substring at that location.

Alternation usage creates strange behavior

2 Answers2

Linked