0

How can I get the position of a matched characters(small string) inside a string(fasta) in python?

I am using a fasta file as String to search for a motif using regular expression '[AGCT][TG][TC][GT]TG' along with the motif, I also wish to know and save the position of motif occurred in the string.

rdict = dict([ (x[1],x[0]) for x in enumerate(Seq) ])
motif = '[AGCT][TG][TC][GT]TG'
#for match in Seq:
matches = re.findall(motif, Seq.upper())
print(matches)
Seq.index(matches)

The above code does the work to search for the motif but returns only position of one character. How can I change this to give the start to end position of the motif(small string).

DjaouadNM
  • 22,013
  • 4
  • 33
  • 55
Kay
  • 90
  • 8
  • 1
    If you know the position of 1 character, you also know the length of the match is 6 so what can't you do ? –  Aug 01 '19 at 16:51
  • Maybe `matches = [x.span() for x in re.finditer(motif, Seq.upper())]`? – Wiktor Stribiżew Aug 01 '19 at 16:56
  • `iter = re.finditer(motif,Seq.upper()) indices = [m.start(0) for m in `iter] –  Aug 01 '19 at 16:57
  • See https://stackoverflow.com/questions/2674391/python-locating-the-position-of-a-regex-match-in-a-string/16360404 to get some ideas on how you can do this. –  Aug 01 '19 at 16:59
  • Yes, please let know if https://stackoverflow.com/a/16360404/3832970 answers your question. – Wiktor Stribiżew Aug 01 '19 at 16:59
  • Its basically calling the `start()` function of the match object. You have access to the matched substring and its position. Create your arrays, maybe an array of array's. –  Aug 01 '19 at 17:02
  • @sln .. thanks for the links but findall is the only option works with fasta sequences so far, I had tried finditer and re.search but they have issues with list of strings. – Kay Aug 01 '19 at 17:58
  • @WiktorStribiżew .. thanks but re.search isn't good with lists. – Kay Aug 01 '19 at 18:00
  • Also, I tried something as ```binding = [] index = [] #print(matches) for match in Seq: matches = re.findall(motif, Seq.upper()) for char in matches: pos = Seq.index(matches[0]) if len(matches) > 0: dataframe = pd.DataFrame({'index':pos, 'binding':matches }) binding.append(matches) index.append(pos) print(len(matches)) dataframe.head()``` but the second loop with index is stuck at first position, any suggestions? – Kay Aug 01 '19 at 18:00
  • @Kay *re.search isn't good with lists* - I have nowhere advised to use `re.search`. What is your exact input? What is your exact expected output ? – Wiktor Stribiżew Aug 01 '19 at 20:02
  • @WiktorStribiżew @WiktorStribiżew input is fasta sequences that looks like this ```Seq=GGAGGGAGAAGCAGCCTGAACCGGGCTGGTCTCTCTGGGATTGGAGAGAAAGGTGGCGGAGaGCGGCGGGGGTGGGGGG``` and expected output is ```+------+-------+---------+ | | start | binding | +------+-------+---------+ . | 0 | 210 | GGCTTG | . | 1 | 317 | TTTTTG | . | 2 | 389 | GGCGTG | . | .... | .. | .... | . | .... | .. | .... | . | 3 | 810 | CGCGTG | . | 4 | 810 | CTCTTG | . +------+-------+---------+ . ``` – Kay Aug 01 '19 at 21:47

1 Answers1

0

For multiple matches along with their start and end indices, use finditer instead:

matches = re.finditer(motif, Seq.upper())

for match in matches:
  string_matched = match[0]
  start_index = match.start(0)
  end_index = match.end(0)
DjaouadNM
  • 22,013
  • 4
  • 33
  • 55
  • Thanks ! but its throws error as ```ValueError: If using all scalar values, you must pass an index``` – Kay Aug 01 '19 at 17:50
  • @Kay That's a pandas error, you didn't mention how you're using the above in a dataframe. – DjaouadNM Aug 01 '19 at 21:26
  • ```binding.append(string_matched) start.append(start_index) end.append(end_index) dataframe = pd.DataFrame({ 'binding':binding, 'start':start, 'end':end}) dataframe.head()``` – Kay Aug 01 '19 at 21:33
  • Thanks for above, I 'm creating a list of matches and indices and then put them together in dataframe. – Kay Aug 01 '19 at 21:35