Finding k-mers in a sliding window

Question

I am trying to work through this bioinformatics problem: https://stepic.org/lesson/An-Explosion-of-Hidden-Messages-4/step/1?course=Bioinformatics-Algorithms-2&unit=8

The specific question is in the 5th window of the link above, and the question is: How many different 9-mers form (500,3)-clumps in the E. coli genome? (In other words, do not count a 9-mer more than once.)

My code is below. It is wrong, and I would love an explanation for why, and how I can improve it (obviously the O efficiency is terrible, but I started coding Python a few days ago...) Thanks so much!

genome = '' #insert e. Coli genome here
k = 4 #length of k-mer
L = 50 #size of sliding window
t = 3 #k-mer appears t times
counter = 0
Count = []


for i in range(0,len(genome)-L): #slide window down the genome
    pattern = genome[i:i+k] #given this k-mer
    for j in range(i,i+L): #calculate k-mer frequency in window of len(L)
        if genome[j:j+k] == pattern:
            counter = counter + 1
    Count.append(counter)
    counter = 0 #IMPORTANT: reset counter after each i

Clump = []
for i in range(0,len(Count)):
    if Count[i] == t: #figure out the window that has k-mers of frequency t
        Clump.append(i)

Output = []
for i in range(0,len(Clump)):
    Output.append(genome[Clump[i]:Clump[i]+k])
print " ".join(list(set(Output))) #remove duplicates if a particular k-mer is found more than once
print len(Output)
print len(list(set(Output))) #total number of Clump(k,L,t)

Error 403: Problem link unavailable to those not subscribed to the course. — Hugues Fontenelle, Oct 29 '14 at 10:26
What is wrong about your code? Error message? (then copy it) or wrong output? (then copy it, and also the expected output) — Hugues Fontenelle, Oct 29 '14 at 10:27
I see lots of indexes, counters and for loops. Python is not Matlab or C. Have a look at a Python tutorial! — Hugues Fontenelle, Oct 29 '14 at 10:28
Sorry, here is an explanation of the problem: https://www.dropbox.com/s/qcb8mrc7fab2ra5/Screenshot%202014-10-29%2012.51.11.png?dl=0 — epicode, Oct 29 '14 at 16:51
A (500,3)-clump means given a sliding window of length 500 and a k-mer frequency of 3, how many unique k-mers of length k can you find? My code works, it outputs "872" for the E. Coli genome given k=9, L=500, and t=3 (E. coli genome here: https://stepic.org/media/attachments/lessons/4/E-coli.txt) According to Google, the correct answer is 1904. and yes -- I'm working on making it more pythonic! — epicode, Oct 29 '14 at 17:23

score 8 · Accepted Answer · edited May 23 '17 at 12:10

Interesting problem. I've put up an implementation with a few tests on github here. Read on for some explanation.

ben@nixbox:~/bin$ time python kmers.py ../E-coli.txt 9 500 3
(500, 3)-clumps of 9-mers found in that file: 1904

real    0m15.510s
user    0m14.241s
sys     0m0.956s

This issue here (as is common in big data) really comes down to choosing the right data structure, and making some time/space tradeoffs. If you choose correctly, you can do this in time linear to the length of your genome and space linear to the length of your sliding window. But I'm getting ahead of myself. Let's explain the problem visually (mostly so I can understand it :-)).

cats on the internet

There is a (20,3)-clump of 3-mers in this window: "CAT". There are some other ones too ("AAA" for one), but this example illustrates what k, L, and t are doing.

Now, on to the algorithm. Let's reduce the problem even further so we can visualize how we'd parse and store it: let's look at a simple (5,3)-clump of 3-mers.

5-3 clump

Brackets denote our sliding window of width 5 here. We can see that in our window our 3-mers break down to ATA, TAA, and AAA. When we slide the window one to the right, ATA drops out and we gain a second AAA. When we slide the window to the right another time, now TAA drops out and we gain a third AAA - and we've found a (5,3) clump of AAAs.

Trivial, obviously, but useful for figuring out how we'd treat larger clumps - importantly, we don't discard the entire previous window's data when we shift our window; we just discard the first k-mer and append the new one at the end of our window. The next insight is that we can use a hash-backed structure (in python, dicts) to do counting of k-mers within our window. This removes the need to linearly search through our data structure to figure out how many of a particular k-mer are in there.

So together those two requirements - remembering order of insertion and a hash-backed data structure - means that we should make a custom class that maintains a list - or better, deque - of each kmer that you have in your window, and a dict - or better, Counter - to keep track of the frequency of each kmer in your deque. Note that an OrderedDict comes close to doing all of this for you, but not quite; popping off the oldest kmer would be wrong if its count is greater than 1.

Another thing that you really should be using to simplify your code here is a proper sliding window iterator.

Putting it all together:

def get_clumps(genome, k, L, t):
    kmers = KmerSequence(L-k, t)

    for kmer in sliding_window(genome, k):
        kmers.add(kmer)

    return kmers.clumps

class KmerSequence(object):
    __slots__ = ['order', 'counts', 'limit', 'clumps', 't']

    def __init__(self, limit, threshold):
        self.order = deque()
        self.counts = Counter()
        self.limit = limit
        self.clumps = set()
        self.t = threshold

    def add(self, kmer):
        if len(self.order) > self.limit:
            self._remove_oldest()
        self._add_one(kmer)

    def _add_one(self,kmer):
        self.order.append(kmer)
        new_count = self.counts[kmer] + 1
        self.counts[kmer] = new_count

        if new_count == self.t:
            self.clumps.add(kmer)

    def _remove_oldest(self):
        self.counts[self.order.popleft()] -= 1

Usage:

with open(genomefile) as f:
    genome = f.read()

k = 9
L = 500
t = 3

clumps = get_clumps(genome, k,L,t)

As mentioned at the top, the full code - which includes some accessory functions and handling for when the script is run directly as __main__ - is on github here. Feel free to fork, etc.

score 2 · Answer 2 · answered Nov 02 '14 at 01:34

just another implementation

def findClumps(genome, k, L, t):
    length = len(genome) - k + 1
    indexes = defaultdict(list)
    result = set()

    for i in range(length):
        kterm = genome[i:i + k]

        # remove positions with distance > L
        while indexes[kterm] and i + k - indexes[kterm][0] > L:
            indexes[kterm].pop(0)

        # add current kterm position
        indexes[kterm].append(i)

        if len(indexes[kterm]) == t:
            result.add(kterm)

    return result

filename = 'E-coli.txt'
file = open(filename)
genome = file.read()
k=9
L=500
t=3

def test():
    print findClumps(genome, k, L, t)

import cProfile

cProfile.run('test()')

Finding k-mers in a sliding window

2 Answers2