7

I think my code is too inefficient. I'm guessing it has something to do with using strings, though I'm unsure. Here is the code:

genome = FASTAdata[1]
genomeLength = len(genome);

# Hash table holding all the k-mers we will come across
kmers = dict()

# We go through all the possible k-mers by index
for outer in range (0, genomeLength-1):
    for inner in range (outer+2, outer+22):
        substring = genome[outer:inner]
        if substring in kmers:              # if we already have this substring on record, increase its value (count of num of appearances) by 1
            kmers[substring] += 1
        else:
            kmers[substring] = 1            # otherwise record that it's here once

This is to search through all substrings of length at most 20. Now this code seems to take pretty forever and never terminate, so something has to be wrong here. Is using [:] on strings causing the huge overhead? And if so, what can I replace it with?

And for clarity the file in question is nearly 200mb, so pretty big.

  • Could you please share few samples of genome strings from input file? – nickzam Nov 07 '13 at 12:39
  • It's all from the A,T,C,G nucleotide alphabet. So for instance ATCGTACGCGTTACGTACG etc etc. It's just 191477429 characters long. I've narrowed the code runtime problem down to this chunk for sure. Everything else works. – user2964780 Nov 07 '13 at 12:42
  • 4
    If `genomeLength` is ~200M, then you've got ~40,000,000,000 iterations. That's not a regime Python's great in. (PS: I think your code has a bug: `substring` can be less than `inner-outer` characters long at the end, so you can count the same slice of characters multiple times.) – DSM Nov 07 '13 at 12:45
  • How long is "forever"? If you use 1 microsecond per substring (which would be pretty fast, I think) it will take over an hour. – molbdnilo Nov 07 '13 at 12:48
  • Well yes there can be repeats at the end, but that's a negligible fraction of the total. Also this doesn't even finish running if I'm only looking for 2-mers, like AT, GC, etc. Should be only ~400M iterations there? – user2964780 Nov 07 '13 at 12:50
  • Forever as in I have been running it for over an hour. – user2964780 Nov 07 '13 at 12:51
  • 2
    It might be helpful to test it with a very short file to see if it yields the wanted results. – glglgl Nov 07 '13 at 12:57
  • Are you running Python 3 or 2? – Henrik Andersson Nov 07 '13 at 13:06
  • It does work on small files. I have tested it on those, it's just this big file that is killing it, and the problem is in this bit of code. – user2964780 Nov 07 '13 at 13:10
  • @user2964780: a negligible fraction of the total in the full problem, but a significant fraction of the total of the small test cases you'd write to check your code. I just ran your code for the 2-mer case (with a smaller string, 191*10**5 instead of 6) and it took 15s. @limelight's question is a good one -- if you're using Python 2, maybe you're wasting time and memory constructing the `range`. – DSM Nov 07 '13 at 13:10
  • I'm running Python 3 on LiClipse – user2964780 Nov 07 '13 at 13:11
  • k-mers are k characters long where k varies up to the maximum size I set. I'm thinking there has to be some online way of doing this so I don't end up going through the string so many times... – user2964780 Nov 07 '13 at 13:12
  • 2
    Consider using a `defaultdict(int)` instead of a plain `dict`, then you could do: `kmers[substring] += 1` without the `if`s it would also avoid the double-lookup. – Bakuriu Nov 07 '13 at 13:12
  • But I have to construct the dict in the first place by iterating over the 200M string, in order to create the substrings (k-mers) of length at most k. I see your point about defaultdict though, I didn't know that existed. – user2964780 Nov 07 '13 at 13:15
  • 1
    You should change `range` calls to `xrange` calls so you will contruct iterators instead of full lists. – nickzam Nov 07 '13 at 13:15
  • Ah, just ran into a memory error after 30 minutes of running the k=200 case. – user2964780 Nov 07 '13 at 13:15
  • @user2964780 What? I mean just replace `kmers = dict()` with `from collections import defaultdict; kmers = defaultdict(int)` then in the inner `for` just do `kmers[genome[outer:inner]] += 1` and you have an equivalent code. – Bakuriu Nov 07 '13 at 13:17
  • Oh I know what you mean, I'm just saying that still leaves me with the iterator problem. It's a good change though! the dict to defaultdict – user2964780 Nov 07 '13 at 13:18
  • By the way, I believe if you want an efficient implementation you *must* change approach. There is an [other question](http://stackoverflow.com/questions/6402109/counting-the-number-of-substrings) about counting the substrings in a text efficiently. The answers are to use either a suffix tree/array or Rabin-Karp algorithm. You could try to search some packages that implement this data structure/algorithm and try to see if the timings improve with big files (for small files they are probably slower than the naive approach). – Bakuriu Nov 07 '13 at 13:22
  • Also there's no xrange in python 3 so I can't do that – user2964780 Nov 07 '13 at 13:23
  • Ah, reading about Rabin-Karp – user2964780 Nov 07 '13 at 13:24
  • Rabin-Karp seems to be searching for specific substrings. The problem is, I'm basically searching over every possible substribg, which is quickly exponential in size if I were to apply Rabin-Karp to that, no? – user2964780 Nov 07 '13 at 13:29
  • What is the general purpose of the code? To count all k-mers that can be generated from genome in file or return top N k-mers by number of occurrences? – nickzam Nov 07 '13 at 14:43
  • xrange in Python 2 is equivalent to range in Python 3, so no need to bother about that. If you really want to collect all possible substrings, I recommend doing it in batch to alleviate the memory problem, one value of `k` at a time, save the result into a file, then continue for the next value of `k`. As for the running time, your file is large and Python is not good for this kind of CPU-extensive task. You probably want to change to C or C++. It will be *much* faster, believe me. =D – justhalf Nov 08 '13 at 02:45

2 Answers2

1

I would recommend using a dynamic programming algorithm. The problem is that for all inner strings that are not found, you are re-searching those again with extra characters appended onto them, so of course those will also not be found. I do not have a specific algorithm in mind, but this is certainly a case for dynamic programming where you remember what you have already searched for. As a really crummy example, remember all substrings of length 1,2,3,... that are not found, and never extend those bases in the next iteration where the strings are only longer.

Tommy
  • 12,588
  • 14
  • 59
  • 110
1

You should use memoryview to avoid creating sub-strings as [:] will then return a "view" instead of a copy, BUT you must use Python 3.3 or higher (before that, they are not hashable).

Also, a Counter will simplify your code.

from collections import Counter

genome = memoryview("abcdefghijkrhirejtvejtijvioecjtiovjitrejabababcd".encode('ascii'))
genomeLength = len(genome)

minlen, maxlen = 2, 22

def fragments():
    for start in range (0, genomeLength-minlen):
        for finish in range (start+minlen, start+maxlen):
            if finish <= genomeLength:
                yield genome[start:finish]

count = Counter(fragments())
for (mv, n) in count.most_common(3):
    print(n, mv.tobytes())

produces:

4 b'ab'
3 b'jt'
3 b'ej'

A 1,000,000 byte random array takes 45s on my laptop, but 2,000,000 causes swapping (over 8GB memory use). However, since your fragment size is small, you can easily break the problem up into million-long sub-sequences and then combine results at the end (just be careful about overlaps). That would give a total running time for a 200MB array of ~3 hours, with luck.

PS To be clear, by "combine results at the end" I assume that you only need to save the most popular for each 1M sub-sequence, by, for example, writing them to a file. You cannot keep the counter in memory - that is what is using the 8GB. That's fine if you have fragments that occur many thousands of times, but obviously won't work for smaller numbers (you might see a fragment just once in each of the 200 1M sub-sequences, and so never save it, for example). In other words, results will be lower bounds that are incomplete, particularly at lower frequencies (values are complete only if a fragment is found and recorded in every sub-sequence). If you need an exact result, this is not suitable.

andrew cooke
  • 45,717
  • 10
  • 93
  • 143