Interesting problem. I've put up an implementation with a few tests on github here. Read on for some explanation.
ben@nixbox:~/bin$ time python kmers.py ../E-coli.txt 9 500 3
(500, 3)-clumps of 9-mers found in that file: 1904
real 0m15.510s
user 0m14.241s
sys 0m0.956s
This issue here (as is common in big data) really comes down to choosing the right data structure, and making some time/space tradeoffs. If you choose correctly, you can do this in time linear to the length of your genome and space linear to the length of your sliding window. But I'm getting ahead of myself. Let's explain the problem visually (mostly so I can understand it :-)).

There is a (20,3)-clump of 3-mers in this window: "CAT". There are some other ones too ("AAA" for one), but this example illustrates what k, L, and t are doing.
Now, on to the algorithm. Let's reduce the problem even further so we can visualize how we'd parse and store it: let's look at a simple (5,3)-clump of 3-mers.

Brackets denote our sliding window of width 5 here. We can see that in our window our 3-mers break down to ATA
, TAA
, and AAA
. When we slide the window one to the right, ATA
drops out and we gain a second AAA
. When we slide the window to the right another time, now TAA
drops out and we gain a third AAA
- and we've found a (5,3) clump of AAA
s.
Trivial, obviously, but useful for figuring out how we'd treat larger clumps - importantly, we don't discard the entire previous window's data when we shift our window; we just discard the first k-mer and append the new one at the end of our window. The next insight is that we can use a hash-backed structure (in python, dict
s) to do counting of k-mers within our window. This removes the need to linearly search through our data structure to figure out how many of a particular k-mer are in there.
So together those two requirements - remembering order of insertion and a hash-backed data structure - means that we should make a custom class that maintains a list
- or better, deque
- of each kmer that you have in your window, and a dict
- or better, Counter
- to keep track of the frequency of each kmer in your deque. Note that an OrderedDict
comes close to doing all of this for you, but not quite; popping off the oldest kmer would be wrong if its count is greater than 1.
Another thing that you really should be using to simplify your code here is a proper sliding window iterator.
Putting it all together:
def get_clumps(genome, k, L, t):
kmers = KmerSequence(L-k, t)
for kmer in sliding_window(genome, k):
kmers.add(kmer)
return kmers.clumps
class KmerSequence(object):
__slots__ = ['order', 'counts', 'limit', 'clumps', 't']
def __init__(self, limit, threshold):
self.order = deque()
self.counts = Counter()
self.limit = limit
self.clumps = set()
self.t = threshold
def add(self, kmer):
if len(self.order) > self.limit:
self._remove_oldest()
self._add_one(kmer)
def _add_one(self,kmer):
self.order.append(kmer)
new_count = self.counts[kmer] + 1
self.counts[kmer] = new_count
if new_count == self.t:
self.clumps.add(kmer)
def _remove_oldest(self):
self.counts[self.order.popleft()] -= 1
Usage:
with open(genomefile) as f:
genome = f.read()
k = 9
L = 500
t = 3
clumps = get_clumps(genome, k,L,t)
As mentioned at the top, the full code - which includes some accessory functions and handling for when the script is run directly as __main__
- is on github here. Feel free to fork, etc.