1

Disclaimer: This question is more from curiosity and will to learn a bit more about Regex, I know it can be achieved with other methods.

I have a string that represents a list, like so: "egg,eggplant,orange,egg", and I want to search for all the instances of the item egg in this list.

I can't search for the substring egg, because it would also return eggplant.

So, I tried to write a regex expression to solve this and got to this expression ((?:^|\w+,)egg(?:$|,\w+))+ (I used this website to build the regex)

Basically, it searches for the word egg at the beginning of the string, the end of the string and in-between commas (while making sure those aren't trailing commas).

And it works fine, except this edge case: "egg,eggplant,egg"

Based on this site, I can see that the first egg is matched but then the regex engine continues until the last comma. Then for the last egg it has the remaining sting ,egg which doesn't match…

So, what can I do to fix the expression and find all the instances of a word in a string that represent a list?

SagiZiv
  • 932
  • 1
  • 16
  • 38
  • 2
    Use `\b` to match word boundaries. So search for `\begg\b` – Barmar Oct 20 '22 at 23:17
  • 1
    Maybe `(?<![^,])egg(?![^,])`? to make sure you match `egg` inside commas, or start/end of string? That is, word boundaries will find `egg` in `egg-head`. – Wiktor Stribiżew Oct 20 '22 at 23:18
  • @Barmar using `\b` works for this specific example, but as Wiktor Stribiżew wrote it won't work for all cases – SagiZiv Oct 20 '22 at 23:23
  • @WiktorStribiżew Yes, it seems to work. Can you please explain what it does? – SagiZiv Oct 20 '22 at 23:24
  • 1
    Those are negative lookarounds that prevent matching if the word is preceded by or followed by something other than `,` – Barmar Oct 20 '22 at 23:25
  • Your attempted solution doesn't need `\w+`. But it still has a problem that it won't work if you have `egg,egg`, because matches can't overlap. That's the problem that @WiktorStribiżew's lookarounds solve. – Barmar Oct 20 '22 at 23:27
  • @WiktorStribiżew I reopened the question, you can post that as an answer (or find a more appropriate dupe). – Barmar Oct 20 '22 at 23:28
  • Interesting… It seems that the websites I mentioned in the question can't compile this expression. Where can I find more details (and preferably a visualizer) to lean more? – SagiZiv Oct 20 '22 at 23:29

1 Answers1

2

You can use

(?<![^,])egg(?![^,])

Or its less efficient equivalent:

(?<=,|^)egg(?=,|$)

See the regex demo. Details:

  • (?<![^,]) - a negative lookbehind that requires start of string or comma to appear immediately to the left of the current location
  • egg - a word
  • (?![^,]) - a negative lookahead that requires end of string or comma to appear immediately to the right of the current location.

See the regex graph:

enter image description here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks! I still don't understand why it is `negative lookbehind`. From what I understood, it check that there is not `start of string` nor `,` behind the word `egg`. Why isn't it a `positive lookbehind`? – SagiZiv Oct 21 '22 at 00:05
  • 1
    Wiktor is employing a double negative, cannot be preceded by a not comma and cannot be followed by a not comma. He knows it is convenient because any negative lookbehind is satisfied at the beginning of the string and negative lookaheads are satisfied at end of string, so you don't need to add crufty logic to handle ^ and $. – Chris Maurer Oct 21 '22 at 01:40
  • @SagiZiv Positive lookbehinds are like `(?<=,|^)`, with the `?<=` at the start. Double negation is more efficient, and makes the pattern compliant with more regex engines. – Wiktor Stribiżew Oct 21 '22 at 08:53