1

I am having some difficulty writing a regex expression that finds words in a text that contain 'zz', but not at the start and the end of the text. These are two of my many attempts:

pattern = re.compile(r'(?!(?:z){2})[a-z]*zz[a-z]*(?!(?:z){2})')
pattern = re.compile(r'\b[^z\s\d_]{2}[a-z]*zz[a-y][a-z]*(?!(?:zz))\b')

Thanks

Zoe
  • 27,060
  • 21
  • 118
  • 148
shillos
  • 23
  • 6
  • you can just slice the start and end off `string[2:-2]` – mama Dec 04 '21 at 12:26
  • Can you please clarify what the input looks like? I think some have (mis)understood that you are matching against individual words instead of whole sentences or word lists. You’ll get different answers if you are matching “dazzle” vs “buzz\ndazzle zap razzle” – pilcrow Dec 04 '21 at 12:37
  • Imagine having a text in a book and trying to find words that meet the criteria I listed. Jan already provided a solution. Thanks for trying to help. – shillos Dec 04 '21 at 12:46
  • What about a word with several occurences of zz: azzazza, azzzzza ? – Casimir et Hippolyte Dec 04 '21 at 16:20
  • @CasimiretHippolyte they are accepted – shillos Dec 05 '21 at 19:05

5 Answers5

3

Well, the direct translation would be

\b(?!zz)(?:(?!zz\b)\w)+zz(?:(?!zz\b)\w)+\b

See a demo on regex101.com.


Programmatically, you could use

text = "lorem ipsum buzz mezzo mix zztop but this is all"

words = [word 
         for word in text.split()
         if not (word.startswith("zz") or word.endswith("zz")) and "zz" in word]

print(words)

Which yields

['mezzo']

See a demo on ideone.com.

Jan
  • 42,290
  • 8
  • 54
  • 79
3

Another idea to use non word boundaries.

\B matches at any position between two word characters as well as at any position between two non-word characters ...

\w*\Bzz\B\w*

See this demo at regex101


Be aware that above matches words with two or more z. For exactly two:

\w*(?<=[^\Wz])zz(?=[^\Wz])\w*

Another demo at regex101


Use any of those patterns with (?i) flag for caseless matching if needed.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
2

You can use lookarounds:

\b(?!zz)\w+?zz\w+\b(?<!zz)

demo

or not:

\bz?[^\Wz]\w*?zz\w*[^\Wz]z?\b

demo

Limited to ASCII letters this last pattern can also be written:

\bz?[a-y][a-z]*?zz[a-z]*[a-y]z?\b
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0

You can use negative lookahead and negative lookbehind assertions in the regex.

>>> import re
>>> text = 'ggksjdfkljggksldjflksddjgkjgg'
>>> re.findall('(?<!^)g{2}(?!$)', text)
 ['gg']
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
0

Your criteria just means that the first and last letter cannot be z. So we simply have to make sure the first and last letter is not z, and then we have a zz somewhere in the text.

Something like

^[^z].*zz.*[^z]$

should work

SztupY
  • 10,291
  • 8
  • 64
  • 87