1

I have a following content

ONE
1234234534564   123
34erewrwer323   123
123fsgrt43232   123
TWO
42433412133fr   234
fafafd3234132   342
THREE
sfafdfe345233   3234
FOUR
324ereffdf343   4323
fvdafasf34nhj   4323
fsfnhjdgh342g   4323

Consider ONE,TWO,THREE and FOUR are separate group.In that I want match only ONE and FOUR, based on the condition of second value of each line in the every group must be same and it will match group that has more than one line in that..How can I do that in regular expression

I have already tried following regex, but its not up to the mark

\w+\n\w+\t(\d+)(\n\w+\t\1){2,}
ᴀʀᴍᴀɴ
  • 4,443
  • 8
  • 37
  • 57
pavithran G
  • 112
  • 2
  • 13

1 Answers1

1

You may use

r'(?m)^[A-Z]+\r?\n\S+\s+(\d+)(?:\r?\n\S+\s+\1)+$'

See the regex demo.

Details

  • (?m) - enable re.MULTILINE mode to make ^ / $ match start and end of lines respectively
  • ^ - start of a line
  • [A-Z]+ - 1+ uppercase ASCII letters (adjust as you see fit)
  • \r?\n - a line break like CRLF or LF
  • \S+ - 1+ non-whitespace chars
  • \s+ - 1 whitespaces (or use \t if a tab is the field separator)
  • (\d+) - Capturing group 1, one or more digits
  • (?:\r?\n\S+\s+\1)+ - one or more repetitions of a line break followed with 1+ non-whitespaces, 1+ whitespaces and the same value as in Group 1 since \1 is a backreference to the value stored in that group
  • $ - end of line.

In Python, use re.finditer:

for m in re.finditer(r'(?m)^[A-Z]+\r?\n\S+\s+(\d+)(?:\r?\n\S+\s+\1)+$', text):
    print(m.group())

See the Python demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563