2

I am learning the ropes with regular expression in Python. I have the code below:

import re

test = '"(Z101+Z102+Z1034+Z104)/4"'
regex = re.compile(r"[\(\+]([XYZ]\d\d\d)[\)\+]")
regex.findall(test)

It returns:

['Z101', 'Z104']

However, when I change 'Z101' to 'YZ101':

import re

test = '"(YZ101+Z102+Z1034+Z104)/4"'
regex = re.compile(r"[\(\+]([XYZ]\d\d\d)[\)\+]")
regex.findall(test)

It returns:

['Z102', 'Z104']

The purpose is to extract strings containing X, Y or Z following by any set of three digits. Therefore, the desired output for the first code would be:

['Z101', 'Z102', 'Z104']

How to fix the compile and get the correct output?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Ken Masters
  • 239
  • 2
  • 17
  • 1
    The problem is very common: the left and right hand boundaries are consuming the text, and consecutive matches are not thus detected. Use lookarounds, `r"(?<=[(+])([XYZ]\d\d\d)(?=[)+])"` – Wiktor Stribiżew May 08 '21 at 10:36
  • Thank you, @WiktorStribiżew. The second comment is the exact solution and explanation which I am looking for. – Ken Masters May 08 '21 at 11:50

3 Answers3

3

The left and right hand boundary patterns ([\(\+] and [\)\+]) are consuming the text they match, and thus consecutive matches are not thus detected.

You can solve the problem using lookarounds,

r"(?<=[(+])([XYZ]\d\d\d)(?=[)+])"
r"(?<=[(+])[XYZ]\d{3}(?=[)+])"

Details

  • (?<=[(+]) - a positive lookbehind that matches a location that is immediately preceded with ( or +
  • [XYZ] - X, Y or Z
  • \d{3} - three digits
  • (?=[)+]) - a positive lookahead that makes sure there is ) or + immediately to the right of the current location.

Note the word boundary, \b, can solve the issue in some situations, it might also help you here, too.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
2

Use re.findall with the pattern [XYZ]\d{3}\b:

test = '"(YZ101+Z102+Z1034+Z104)/4"'
matches = re.findall(r'[XYZ]\d{3}\b', test)
print(matches)  # ['Z101', 'Z102', 'Z104']
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

Your pattern is looking for:

  1. Either '(' or '+'
  2. Exactly one of 'X', 'Y', or 'Z'
  3. Exactly three numeric characters
  4. Either '(' or '+'

It's not selecting the 'Z101' because when you add 'Y', that substring isn't immediately preceded by '(' or '+'.

One option would be to leave 1 and 4 out of the pattern. In this example, you would get exactly what you want. That pattern would be r'[XYZ]\d\d\d'. Depending on your data, however, that might create a different problem down the road.

Another option would be to include the possibility for a prefixed character with '?'. The '?' means 'zero or one' when used as a quantifier (but it can also modify other quantifiers, but that's a different topic). To do that, your pattern would be r"[(+][XYZ]?([XYZ]\d\d\d)[)+]"

Kyle Alm
  • 587
  • 3
  • 14
  • I added 'Y' to not selecting 'Z101' on purpose. However, It returns 'Z102' while the first code did not which I am confused. I tried the pattern r"[(+][XYZ]?([XYZ]\d\d\d)[)+]" and it yields the same result as above - still missing 'Z102'. – Ken Masters May 08 '21 at 10:59