Finding a substring using regex

Question

Disclaimer: This question is more from curiosity and will to learn a bit more about Regex, I know it can be achieved with other methods.

I have a string that represents a list, like so: "egg,eggplant,orange,egg", and I want to search for all the instances of the item egg in this list.

I can't search for the substring egg, because it would also return eggplant.

So, I tried to write a regex expression to solve this and got to this expression ((?:^|\w+,)egg(?:$|,\w+))+ (I used this website to build the regex)

Basically, it searches for the word egg at the beginning of the string, the end of the string and in-between commas (while making sure those aren't trailing commas).

And it works fine, except this edge case: "egg,eggplant,egg"

Based on this site, I can see that the first egg is matched but then the regex engine continues until the last comma. Then for the last egg it has the remaining sting ,egg which doesn't match…

So, what can I do to fix the expression and find all the instances of a word in a string that represent a list?

Maybe `(?<![^,])egg(?![^,])`? to make sure you match `egg` inside commas, or start/end of string? That is, word boundaries will find `egg` in `egg-head`. — Wiktor Stribiżew, Oct 20 '22 at 23:18
@Barmar using `\b` works for this specific example, but as Wiktor Stribiżew wrote it won't work for all cases — SagiZiv, Oct 20 '22 at 23:23
@WiktorStribiżew Yes, it seems to work. Can you please explain what it does? — SagiZiv, Oct 20 '22 at 23:24
Those are negative lookarounds that prevent matching if the word is preceded by or followed by something other than `,` — Barmar, Oct 20 '22 at 23:25
Your attempted solution doesn't need `\w+`. But it still has a problem that it won't work if you have `egg,egg`, because matches can't overlap. That's the problem that @WiktorStribiżew's lookarounds solve. — Barmar, Oct 20 '22 at 23:27
@WiktorStribiżew I reopened the question, you can post that as an answer (or find a more appropriate dupe). — Barmar, Oct 20 '22 at 23:28
Interesting… It seems that the websites I mentioned in the question can't compile this expression. Where can I find more details (and preferably a visualizer) to lean more? — SagiZiv, Oct 20 '22 at 23:29

score 2 · Accepted Answer · answered Oct 20 '22 at 23:38

2

You can use

(?<![^,])egg(?![^,])

Or its less efficient equivalent:

(?<=,|^)egg(?=,|$)

See the regex demo. Details:

(?<![^,]) - a negative lookbehind that requires start of string or comma to appear immediately to the left of the current location
egg - a word
(?![^,]) - a negative lookahead that requires end of string or comma to appear immediately to the right of the current location.

See the regex graph:

answered Oct 20 '22 at 23:38

Wiktor Stribiżew

607,720
39
448
563

Thanks! I still don't understand why it is `negative lookbehind`. From what I understood, it check that there is not `start of string` nor `,` behind the word `egg`. Why isn't it a `positive lookbehind`? – SagiZiv Oct 21 '22 at 00:05
1

Wiktor is employing a double negative, cannot be preceded by a not comma and cannot be followed by a not comma. He knows it is convenient because any negative lookbehind is satisfied at the beginning of the string and negative lookaheads are satisfied at end of string, so you don't need to add crufty logic to handle ^ and $. – Chris Maurer Oct 21 '22 at 01:40
@SagiZiv Positive lookbehinds are like `(?<=,|^)`, with the `?<=` at the start. Double negation is more efficient, and makes the pattern compliant with more regex engines. – Wiktor Stribiżew Oct 21 '22 at 08:53

Finding a substring using regex

1 Answers1