0

Say I have a regex r"(([a-zA-Z]+)(&|\|)([a-zA-Z]+))", and a string "groupone|grouptwo|groupthree|groupfour".

If I run

re.findall(r"(([a-zA-Z]+)(&|\|)([a-zA-Z]+))", "groupone|grouptwo|groupthree|groupfour")

it returns:

[('groupone|grouptwo', 'groupone', '|', 'grouptwo'), ('groupthree|groupfour', 'groupthree', '|', 'groupfour')]

This is not my desired result. I would also like grouptwo and groupthree to be matched, like this:

[('groupone|grouptwo', 'groupone', '|', 'grouptwo'), ('grouptwo|groupthree', 'grouptwo', '|', 'groupthree'), ('groupthree|groupfour', 'groupthree', '|', 'groupfour')]

What do I need to correct about my regex to make this possible?

the_Yam1
  • 29
  • 4
  • 1
    With normal `re` by capturing inside a lookahead, eg: [`(?<![^|])(?=(([^\W_]+)([&|])([^\W_]+)))`](https://regex101.com/r/BdTHba/1) – bobble bubble Jul 09 '22 at 13:40

1 Answers1

1

You could use the third-party regex module for this. Unlike the standard library re, it supports overlapping matches.

import regex

regex.findall(r"(\b([a-zA-Z]+\b)(&|\|)(\b[a-zA-Z]+)\b)", "groupone|grouptwo|groupthree|groupfour", overlapped=True)

[('groupone|grouptwo', 'groupone', '|', 'grouptwo'),
 ('grouptwo|groupthree', 'grouptwo', '|', 'groupthree'),
 ('groupthree|groupfour', 'groupthree', '|', 'groupfour')]

N.B. please note the addition of word boundaries (\b) in the pattern. If you were to keep your original pattern, you would get a bunch of unwanted matches as well using this method.

ouroboros1
  • 9,113
  • 3
  • 7
  • 26
  • 1
    Was working on the same, came up with `regex.findall(r'((\b[a-zA-Z]+\b)([&|])((?2)))'` – JvdV Jul 09 '22 at 13:08