1

I´m working on an assigment in Python, and I have a question if you could answer. I wanna write a function that returns a list with the locations of the first nucleotide of all occurrences of "ATG" in the sequence. For example, we can say that our DNA sequence is AATGCATGC. We see that ATG can start in the index 1, and the other possibility is index 5. I tried this one to solve this assignment;

dna = "AATGCATGC"
starting_offset = dna.index("ATG")
print(starting_offset)

The result I´ve got is 1. But I wanna get result as [1, 5]

So how should I write this function for all occurrences?

Thanks for helping me :)

  • sounds like a job for itertools - search python & itertools and look through the api – Patrick Artner Nov 14 '17 at 15:50
  • if you want to do it yourself: find the first occurence [x] - there is no possible in match match (cant find ATG starting inside of one find) so create a shorter string from your find position + len(ATG) and find the next index. accumulate them until less then len(ATG) characters left. – Patrick Artner Nov 14 '17 at 15:53
  • @strawberry: example of that approach see my answer – Patrick Artner Nov 14 '17 at 17:27

2 Answers2

2

Using regular expressions, you can use re.finditer to find all occurences:

You can try this function :

import re
text = 'AATGCATGC'
pattern='ATG'
def getIndexes (text,pattern):
    list=[index.start() for index in re.finditer('ATG', text)]
    return list
getIndexes(text,pattern)
>>[1, 5]

It will gives you the list you're looking for . Hope that'll be helpful !

Saghe Achraf
  • 330
  • 1
  • 7
0

If you want something to think about, analyse this:

def GetMultipleInString(dna, term):
    # computing end condition 0
    if (term not in dna):
        print (dna + " does not contain the term " + term)
        return []

    # start of list of lists of 2 elements: index, rest
    result = [[None,dna]]

    # we look for the index in the rest, need to keep track how much we
    # shortened the string in total so far to get index in complete string
    totalIdx = 0

    # we look at the last element of the list until it's length is shorter
    # than the term we look for (end of computing condition 1)
    termLen = len(term)

    while len(result[-1][1]) >= termLen:
        # get the last element
        last = result[-1][1]
        try:
            # find our term, if not found -> exception
            idx = last.index(term) 
            # partition "abcdefg" with "c" -> ("ab","c", "defg")
            # we take only the remaining 
            rest = last.partition(term)[2] 
            # we compute the total index, and put it in our result
            result.append( [idx+totalIdx , rest] ) 
            totalIdx += idx+termLen 
        except:
            result.append([None,last])
            break

    # any results found that are not none? 
    if (any( x[0] != None for x in result)):

        print (dna + " contains the term " + term + " at positions:"),
        # get only indexes from our results
        rv = [ str(x[0]) for x in result if x[0] != None]
        print (' '.join(rv))

        return rv

    else:
        print (dna + " does not contain the term " + term)
        return []

print("_----------------------------------_")
myDna = "AATGCATGC"  
res1 = GetMultipleInString(myDna,"ATG")   
print(res1)

res2 = GetMultipleInString(myDna,"A")
print(res2)
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69