0

I'm trying to match all GY or YG combinations in my string QGYGQGYQQG using the re package in python. I place all these matches in a dict for future look-up.

The problem I run into is when Y is flanked either side by G: Basically my regex can't capture both GY and YG in GYG properly.

This is my code so far:

import re
seq = 'QYGQGYGQQG'
regex = re.compile('(GY|YG)|(?<=Y)G')
iterator = regex.finditer(seq)
dd = {}
for matchedobj in iterator: 
    dd[matchedobj.group()] = dd.get(matchedobj.group(), []) + [matchedobj.start()]

Output:

{'G': [6], 'GY': [4], 'YG': [1]}
wimyang
  • 39
  • 5
  • The newer `regex` module supports overlapping matches. – Jan Apr 24 '20 at 14:42
  • 1
    Just use `(?=(YG|GY))` with findall – ctwheels Apr 24 '20 at 14:56
  • See [this](https://tio.run/##Dc2xCgIxDIDhvU@RrQlIQZwPx84di9wg2tOAaWsuDoLvXjv8y7f8/WvPVk9jsPSmBlrcXt6wgE85pjhL0TudoCXcmnR@FfR4XjDHX8xEnlxXroYXlLDbVQ3pABIe2j4dj0SwNQUBrqBh43pnK4rzQSuN8Qc) for the `finditer` alternative that also gets you the indices – ctwheels Apr 24 '20 at 15:03
  • Thanks all for the tips! I had no clue that "overlapping matches" was a thing.. – wimyang Apr 24 '20 at 16:49

2 Answers2

0

You could use the newer regex module (or use lookarounds):

import regex as re
seq = 'QYGQGYGQQG'

matches = re.findall(r'GY|YG', seq, overlapped=True)
print(matches)
# ['YG', 'GY', 'YG']

Or - with re.finditer:

for m in re.finditer(r'GY|YG', seq, overlapped=True):
    print(m.span())

Which would yield

(1, 3)
(4, 6)
(5, 7)
Jan
  • 42,290
  • 8
  • 54
  • 79
0

Here is a solution you may use which does not depend on overlapping matches:

seq = 'QYGQGYGQQG'
matches = re.findall('G(?=Y)|Y(?=G)', seq)
print([re.sub(r'^Y', 'YG', x.replace('G', 'GY')) for x in matches])

This prints:

['YG', 'GY', 'YG']

The trick here is to match only G and Y, using a lookahead to assert that what follows is the expected Y or G needed to make a full match. This avoids the problem of consuming a second letter which might also be the first letter of another subsequent match. Then, we take those single letter matches, which represent the full matches, and use a list comprehension to build the original overlapping matches.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360