3

My requirement is very simple, but I just could not figure out how to reach it.

This is the original string ACCCTNGGATGTGGGGGGATGTCCCCCATGTGCTCG, I want to find out all the sub-strings that only consist of [ACGT], end with ATGT, and have a length of at least 8. And what I expect is:

GGATGTGGGGGGATGT
GGATGTGGGGGGATGTCCCCCATGT

With following code:

import re

seq = 'ACCCTNGGATGTGGGGGGATGTCCCCCATGTGCTCG'

matches = re.findall("[ACGT]{4,}ATGT", seq)

if matches:
    for match in matches:
        print(match)

I got only

GGATGTGGGGGGATGTCCCCCATGT

The shorter one is missing. Then I realized that re.findall doesn't allow overlapping. I found this solution from How to use regex to find all overlapping matches, then I modified the code as:

matches = re.findall("(?=([ACGT]{4,}ATGT))", seq)

Then I got:

GGATGTGGGGGGATGTCCCCCATGT
GATGTGGGGGGATGTCCCCCATGT
ATGTGGGGGGATGTCCCCCATGT
TGTGGGGGGATGTCCCCCATGT
GTGGGGGGATGTCCCCCATGT
TGGGGGGATGTCCCCCATGT
GGGGGGATGTCCCCCATGT
GGGGGATGTCCCCCATGT
GGGGATGTCCCCCATGT
GGGATGTCCCCCATGT
GGATGTCCCCCATGT
GATGTCCCCCATGT
ATGTCCCCCATGT
TGTCCCCCATGT
GTCCCCCATGT
TCCCCCATGT
CCCCCATGT
CCCCATGT

Then I realized that this searching starts from right to left. So how can I ask re.findall to search from left to right and also allow for overlapping?

Xiaokang
  • 331
  • 1
  • 11
  • See https://ideone.com/RgsP2Z – Wiktor Stribiżew Aug 29 '22 at 14:05
  • "Then I realized that this searching starts from right to left." - What? No, it doesn't. You see `GGATGTGGGGGGATGTCCCCCATGT` first because that is the first thing that matches the pattern, scanning from left to right. It starts immediately after the `N` in the input (the matches may not contain `N` - that is as far to the left as it can start. – Karl Knechtel Apr 16 '23 at 15:02

1 Answers1

5

You can use PyPi's regex module, utilizing reversed and overlapped matching using only a small addition to your initial pattern:

(?r)[ACGT]{4,}ATGT

For example:

import regex as re
seq = 'ACCCTNGGATGTGGGGGGATGTCCCCCATGTGCTCG'
matches = re.findall(r'(?r)[ACGT]{4,}ATGT', seq, overlapped=True)
print(matches)

Prints:

['GGATGTGGGGGGATGTCCCCCATGT', 'GGATGTGGGGGGATGT']
JvdV
  • 70,606
  • 8
  • 39
  • 70
  • 1
    ++ learnt this new `(?r)` feature – anubhava Aug 29 '22 at 18:02
  • 2
    This is a great answer and very well explained could you please explain more on this overlapped flag? Thank you. – RavinderSingh13 Aug 29 '22 at 21:49
  • 1
    @RavinderSingh13, thanks. The logic is rather simple: Reading right to left makes it possible to anchor on the fixed substring rather than all the positions that match arbitrary substrings. Therefor matching right to left first finds the latest occurence of `ATGT` and 4+ characters in `[ACGT]`. Then the machine will keep on looking for more matches further to the left, thus more fixed substrings matching `ATGT` allowing a starting position within the previous match. Only two occurences in this usecase. – JvdV Aug 30 '22 at 06:32