0

I have a multiple patterns regex, it works fine except it matches a redundant pattern in the tuple, if I run the below code:

import re

re1 = 'SENT: (\w+)\_\d{4}(\d+)' re2 = 'SENT: (\w+)\s\w*\s\w{4}(\d{4})' re3 = 'SENT: (\w+)\s\w+\s(\d{4})'

sentences = ['SENT: xyz File 20210630.csv', 'SENT: xyz_20210630_Details.csv', 'SENT: xyz File 070121.txt']

for sentence in sentences: generic_re = re.compile("(%s|%s|%s)" % (re1, re2, re3)).findall(sentence) print(generic_re)

OUTPUT :

[('SENT: xyz File 20210630', '', '', 'xyz', '0630', '', '')] [('SENT: xyz_20210630', 'CAP', '0630', '', '', '', '')] [('SENT: xyz File 0701', '', '', '', '', 'STLB', '0701')]

'SENT: xyz File 20210630'& '' is the redundant part, how to get rid of it and stick with these two groups (xyz) and (0630) in the output.

Alani
  • 73
  • 6
  • 1
    `generic_re = re.compile("%s|%s|%s" % (re1, re2, re3)).findall(sentence)`, remove the outer capturing group. The issue is the `findall` returns all captured substrings, and if you do not want any of those you need to either remove the capturing parentheses or turn the group you do not want to output into a non-capturing one. – Wiktor Stribiżew Jul 05 '21 at 16:23
  • thanks it worked perfectly, how about the empty ' ' capturing ? is there a way to get rid of them as well ? – Alani Jul 05 '21 at 16:28
  • 1
    I suggest just post-processing the matches and remove all empty items from the tuples. Or, you will need to revamp your approach with 3 alternatives so that there is just one. – Wiktor Stribiżew Jul 05 '21 at 16:30

0 Answers0