0

I am trying to get some text extraction using regex in python. The regex is quite complicated being build on the fly depending on the language and the best way to go about it is compose it adding different parts.

This is the present code:

    # reproduction of the problem in small scale
num  = fr"""(\d\d)([A-Z])?"""
sep  = fr"""and |or |, """

#pattern composition
pattern = fr"""((({num})({sep}{num})+)|({num}))"""

text= """biscuits 10 are good
biscuits 20 and 30 are good
biscuits 40 and hot dog are good
but this one 50A and 50B and not ok"""

refs = re.finditer(pattern, text, re.VERBOSE,)
for ref in refs:
    TEXT = ref.group(0)
    print(TEXT)

that gives all the hits separately:

enter image description here

my desire outcome is THE WHOLE MATCH

10
10 and 20
40
50A and 50B

Basically the num is an expression that can appear alone or in combination with others separated by sep.

Of course if num is followed by sep but not again a num only num should be matched.

Anyone knowing how to modify that code to achieve the solution?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
JFerro
  • 3,203
  • 7
  • 35
  • 88

1 Answers1

0

You can use

import re
num  = fr"""(\d\d)([A-Z])?"""
sep  = fr"""and |or |, """

#pattern composition
pattern = fr"""{num}(?:\s*(?:{sep})\s*{num})*"""
text= """biscuits 10 are good
biscuits 20 and 30 are good
biscuits 40 and hot dog are good
but this one 50A and 50B and not ok"""

refs = re.finditer(pattern, text, re.VERBOSE,)
for ref in refs:
    TEXT = ref.group()
    print(TEXT)

See the Python demo. The regex will look like

\d\d[A-Z]?(?:\s*(?:and|or|,)\s*\d\d[A-Z]?)*

See the regex demo. Details:

  • \d\d[A-Z]? - two digits and an optional uppercase ASCII letter
  • (?:\s*(?:and|or|,)\s*\d\d[A-Z]?)* - zero or more repetitons of
    • \s*(?:and|or|,)\s* - and, or or , enclosed with zero or more whitespaces
    • \d\d[A-Z]? - two digits and an optional uppercase ASCII letter
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Cool, it works perfect. Can you add a line of explanation about the use of ?: non capturing group. – JFerro Jun 28 '22 at 15:07
  • 1
    because if you take the \s* out of the equation the difference between your regex and mine is the use of ?: – JFerro Jun 28 '22 at 15:08
  • @JFerro See the [What is a non-capturing group in regular expressions?](https://stackoverflow.com/q/3512471/3832970). The `\s*` is crucial since you expect matches with whitespace chars in them, and your pattern did not match whitespaces. Non-capturing group is of little importance as you use `re.finditer`. – Wiktor Stribiżew Jun 28 '22 at 16:19