-2

I'd like to match all occurances of a substring with python.

I found this, but I'd like to match occurances of a substring separated by at most some distance (for example maximum of 6). So if I have string

AATCGTGCCGTGTGCCCCAAAATGAACGCGCCGCTGTG

I want to get all positions of TG if two TG's are separated by at most 6 characters. So in the example above I'd like to get [5, 10, 12, 34, 36]. I don't want the middle TG positions, because it is too far away from either "group" (for 10 characters).

I tried with this:

(?=TG(?:.+){1,6}?)

but it doesn't work.

EDIT

I created regex that returns all the positions I want, except the last ones.

(?=TG.{0,6}TG)

If I use example above, returned positions are marked with |

AATCG|TGCCG|TGTGCCCCAAAATGAACGCGCCGC|TGTG

but I'd like to get also positions marked with \

AATCG|TGCCG|TG\TGCCCCAAAATGAACGCGCCGC|TG\TG

I know why it doesn't work, because it matches all TG followed by 0-6 random characters and one more TG, but I cannot get the idea what should I add to make it work.

Community
  • 1
  • 1
thecoparyew
  • 715
  • 1
  • 9
  • 24

2 Answers2

1

If using a regex, I think this should work.
The position has to be obtained from the position of capture group 1.

Note - I think your position 34 is not valid.

(?:^.*?|(?<=TG).{0,4})(TG)

Edit:

You can also do it this way to get the start of a new group of TG's.

Using \K - no capture groups: (?:(?<=TG).{0,4}\KTG|.*?\KTG(?=.{0,4}TG))
Formatted:

 (?:
      (?<= TG )
      .{0,4} 
      \K 
      TG
   |  
      .*? 
      \K 
      TG
      (?= .{0,4} TG )
 )

Without Branch Reset - 2 capture groups: (?:(?<=TG).{0,4}(TG)|.*?(TG)(?=.{0,4}TG))
Formatted:

 (?:
      (?<= TG )
      .{0,4} 
      ( TG )                        # (1)
   |  
      .*? 
      ( TG )                        # (2)
      (?= .{0,4} TG )
 )

With Branch Reset - 1 capture group: (?|(?<=TG).{0,4}(TG)|.*?(TG)(?=.{0,4}TG))

Formatted:

 (?|
      (?<= TG )
      .{0,4} 
      ( TG )                        # (1)
   |  
      .*? 
      ( TG )                        # (1)
      (?= .{0,4} TG )
 )
1

You could do something like this, where you first find all the occurences, and then check the distance between indexes afterwards:

text = AATCGTGCCGTGTGCCCCAAAATGAACGCGCCGCTGTG
positions = []
index = 0

# Get all positions
while index < len(text):
    index = text.find('TG', index)
    positions.append(index)
    index += 2 # +2 because len('TG') == 2

# Filter out positions where distance are maximum 6
for i in range(1,len(positions)):
    if(((positions[i] - positions[i-1]) > 6) and ((positions[i+1] - positions[i]) > 6)): # check distance in both directions
        del positions[i]

# print resulting list
print positions

I haven't checked it, but it should do the job.

Johan E. T.
  • 185
  • 8