0

I'm having trouble understanding regex behaviour when using lookahead.

I have a given string in which I have two overlapping patterns (starting with M and ending with p). My expected output would be MGMTPRLGLESLLEp and MTPRLGLESLLEp. My python code below results in two empty strings which share a common start with the expected output.

Removal of the lookahead (?=) results in only ONE output string which is the larger one. Is there a way to modify my regex term to prevent empty strings so that I can get both results with one regex term?

import re

string = 'GYMGMTPRLGLESLLEpApMIRVA'

pattern = re.compile(r'(?=M(.*?)p)')
sequences = pattern.finditer(string)

for results in sequences:
    print(results.group())
    print(results.start())
    print(results.end())
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Ringsi
  • 3
  • 2

1 Answers1

0

The overlapping matches trick with a look-ahead makes use of the fact that the (?=...) pattern matches at an empty location, then pulls out the captured group nested inside the look-ahead.

You need to print out group 1, explicitly:

for results in sequences:
    print(results.group(1))

This produces:

GMTPRLGLESLLE
TPRLGLESLLE

You probably want to include the M and p characters in the capturing group:

pattern = re.compile(r'(?=(M.*?p))')

at which point your output becomes:

MGMTPRLGLESLLEp
MTPRLGLESLLEp
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343