1

I'm trying to write a programme that shifts through the elements, of defined length, of a DNA sequence, I can't understand the output I'm getting from the loop. It seems to frameshift fine for the first four iterations of the loop, then seems to revert to old sequences. I've tried really hard to understand the behaviour but I'm too new to programming to solve this, any help much appreciated.

Here is my code:

seq = "ACTGCATTTTGCATTTT"

search = "TGCATTTTG"

import regex as re

def kmers(text,n):
  for a in text:
    b = text[text.index(a):text.index(a)+n]
    c = len(re.findall(b, text, overlapped=True))
    print ("the count for " + b + " is " + str(c))

(kmers(seq,3))

and my output:

the count for ACT is 1
the count for CTG is 1
the count for TGC is 2
the count for GCA is 2
#I expected 'CAT' next, from here on I don't understand the behaviour

the count for CTG is 1 
the count for ACT is 1
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
the count for GCA is 2
the count for CTG is 1
the count for ACT is 1
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2
the count for TGC is 2

Obviously eventually I want to remove duplicates, etc, but being stuck on why my for loop isn't working how I expected it to has stopped me in my tracks to make this better.

Thanks

Lydia Thomas
  • 93
  • 1
  • 5

1 Answers1

4

text.index always returns the first index found. Since you iterate your seq letter by letter, the first time you hit a previously found letter you get weird results.

The 5th letter is the first duplicate, a c, and so text.index('c') is returning the index of the first c, 1, and not 4 as you expect - and you duplicate the previous time you hit c.

This method is inefficient - you seem to be more interested in moving across indices than letters, so I would use:

for a in range(len(text)-(n-1)):
    b = text[a:a+n]
    c = len(re.findall(b, text, overlapped=True))
    print ("the count for " + b + " is " + str(c))

Instead of searching for the index each time, which is both inefficient and in your case produces wrong results. findall is also an inefficient way of counting here - a dictionary, specifically defaultdict might be constructed to count more efficiently.

Note there are already nice builtins you may use:

>>> from collections import Counter
>>> seq='ACTGCATTTTGCATTTT'
>>> Counter((seq[i:i+3] for i in range(len(seq)-2)))
Counter({'TTT': 4, 'TGC': 2, 'GCA': 2, 'CAT': 2, 'ATT': 2, 'ACT': 1, 'CTG': 1, 'TTG': 1})

The final hits are where the string ends, and you can disregard them.

kabanus
  • 24,623
  • 6
  • 41
  • 74
  • please don't suggest unpythonic pattern like `range(len(text))`, to access indexes use `enumerate(text)` instead. – buran Dec 11 '18 at 12:32
  • Maybe stop the loop `n` steps before the end, to avoid incompletely populated windows. The example looks like the OP only wants substrings of length 3, not shorter ones when you approach the end of the string. – tripleee Dec 11 '18 at 12:34
  • 1
    @buran Please do not suggest what is un-pythonic, or even use that phrase. If you do not need an enumeration (that is, an index and the letter) I do not see a reason to use it. – kabanus Dec 11 '18 at 12:35
  • @tripleee good point, I jsut hastily typed something - adding. – kabanus Dec 11 '18 at 12:35
  • @buran Here is an interesting debate by the way https://stackoverflow.com/questions/11901081/only-index-needed-enumerate-or-xrange. You can see your opinion is slightly in the lead, but I would say far from a clear consensus. – kabanus Dec 11 '18 at 12:38
  • @kabanus, sorry, my bad - I overlooked that you don't need/will not use the letter. thanks for the second link. Anyway, if one need the index and item from the iterable - enumerate is the pattern :-) – buran Dec 11 '18 at 12:41
  • @buran No problem In general the whole Pythonic things tends to get religious at times - there are some clear and cut things that are significantly worse in a lot of parameters (using `map` instead of a comprehension pops to mind), but in general I favor to do what fits you best, either readibility or optimality or whatever wise. – kabanus Dec 11 '18 at 12:42
  • Thank you all very much! I wouldn't have got there. – Lydia Thomas Dec 11 '18 at 13:46