Python search one million strings in a file and count occurrences of each string

Question

This is more about to find the fastest way to do it. I have a file1 which contains about one million strings(length 6-40) in separate line. I want to search each of them in another file2 which contains about 80,000 strings and count occurrence(if small string is found in one string multiple times, the occurence of this string is still 1). If anyone is interested to compare performance, there is link to download file1 and file2. dropbox.com/sh/oj62918p83h8kus/sY2WejWmhu?m

What i am doing now is construct a dictionary for file 2, use strings ID as key and string as value. (because strings in file2 have duplicate values, only string ID is unique) my code is

for line in file1:
   substring=line[:-1].split("\t")
   for ID in dictionary.keys():
       bigstring=dictionary[ID]
       IDlist=[]
       if bigstring.find(substring)!=-1:
           IDlist.append(ID)
   output.write("%s\t%s\n" % (substring,str(len(IDlist))))

My code will take hours to finish. Can anyone suggest a faster way to do it? both file1 and file2 are just around 50M, my pc have 8G memory, you can use as much memory as you need to make it faster. Any method that can finish in one hour is acceptable:)

Here, after I have tried some suggestions from these comments below, see performance comparison, first comes the code then it is the run time.

Some improvements suggested by Mark Amery and other peoples

import sys
from Bio import SeqIO

#first I load strings in file2 to a dictionary called var_seq, 
var_seq={}
handle=SeqIO.parse(file2,'fasta')
for record in handle:
    var_seq[record.id]=str(record.seq)

print len(var_seq) #Here print out 76827, which is the right number. loading file2 to var_seq doesn't take long, about 1 second, you shall not focus here to improve performance

output=open(outputfilename,'w')
icount=0
input1=open(file1,'r')
for line in input1:
    icount+=1
    row=line[:-1].split("\t")
    ensp=row[0]   #ensp is just peptides iD
    peptide=row[1] #peptides is the substrings i want to search in file2
    num=0
    for ID,bigstring in var_seq.iteritems(): 
        if peptide in bigstring:
            num+=1

    newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
    output.write(newline)
    if icount%1000==0:
        break

input1.close()
handle.close()
output.close()

It will take 1m4s to finish. Improved 20s compared to my old one

#######NEXT METHOD suggested by entropy

from collections import defaultdict
var_seq=defaultdict(int)
handle=SeqIO.parse(file2,'fasta')
for record in handle:
    var_seq[str(record.seq)]+=1

print len(var_seq) # here print out 59502, duplicates are removed, but occurances of duplicates are stored as value 
handle.close()

output=open(outputfilename,'w')
icount=0

with open(file1) as fd:
    for line in fd:
        icount+=1
        row=line[:-1].split("\t")
        ensp=row[0]
        peptide=row[1]
        num=0
        for varseq,num_occurrences in var_seq.items():
            if peptide in varseq:
                num+=num_occurrences

    newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
    output.write(newline)
    if icount%1000==0:
        break

output.close()

This one takes 1m10s,not faster as expected since it avoids searching duplicates, don't understand why.

Haystack and Needle method suggested by Mark Amery, which turned out to be the fastest, The problem of this method is that counting result for all substrings is 0, which I don't understand yet.

Here is the code I implemented his method.

class Node(object):
    def __init__(self):
        self.words = set()
        self.links = {}

base = Node()

def search_haystack_tree(needle):
    current_node = base
    for char in needle:
        try:
            current_node = current_node.links[char]
        except KeyError:
            return 0
    return len(current_node.words)

input1=open(file1,'r')
needles={}
for line in input1:
    row=line[:-1].split("\t")
    needles[row[1]]=row[0]

print len(needles)

handle=SeqIO.parse(file2,'fasta')
haystacks={}
for record in handle:
    haystacks[record.id]=str(record.seq)

print len(haystacks)

for haystack_id, haystack in haystacks.iteritems(): #should be the same as enumerate(list)
    for i in xrange(len(haystack)):
        current_node = base
        for char in haystack[i:]:
            current_node = current_node.links.setdefault(char, Node())
            current_node.words.add(haystack_id)

icount=0
output=open(outputfilename,'w')
for needle in needles:
    icount+=1
    count = search_haystack_tree(needle)
    newline="%s\t%s\t%s\n" % (needles[needle],needle,str(count))
    output.write(newline)
    if icount%1000==0:
        break

input1.close()
handle.close()
output.close()

It takes only 0m11s to finish, which is much faster than other methods. However, I don't know it is my mistakes to make all counting result as 0, or there is a flaw in the Mark's method.

Without discussing how the algorithm could be enhanced you might get performance boost with PyPy http://pypy.org/ — Mikko Ohtamaa, Mar 03 '13 at 20:42
Also use "in" operator instead of string.find() - this is not Javascript or PHP — Mikko Ohtamaa, Mar 03 '13 at 20:43
I can see a few minor things I'd change right off the bat: 1) Do `for ID, bigstring in dictionary.iteritems()` instead of doing `bigstring=dictionary[ID]`, 2) Just keep a count of matches instead of an IDList, since you're only using its length anyway; i.e. replace `IDlist=[]` with `count=0` and `IDlist.append(ID)` with `count+=1`, 3) Use `substring in bigstring` instead of using `.find`. These aren't going to be huge performance boosters, though. The real opportunity for improvement here, I'd guess, will come from not iterating over every string in file2 a million times - but that's tricky. — Mark Amery, Mar 03 '13 at 20:52
The smart solution is to build a [suffix tree](http://en.wikipedia.org/wiki/Suffix_tree) for the contents of file2... Try to search for some suffix tree implementation, for example [here](http://stackoverflow.com/questions/8996137/suffix-tree-implementation-in-python) there are some links. This reduces the complexity from O(m*n) to O(n+m) where `n` and `m` is the number of records in file1 and file2(assuming constant size for the items in the files). — Bakuriu, Mar 03 '13 at 20:53
I wonder if using a regex to simultaneously look for any of the million strings from file 1 would yield a performance gain? — Mark Amery, Mar 03 '13 at 20:54
Thanks for your answer! I tried "in" instead which performs 14s faster for 1000 strings search. But still takes 1m21s, before it was 1m35s for 1000 strings. I will PyPy later. — xiaozhu123, Mar 03 '13 at 20:54
@xiaozhu123 I did not know you wanted duplicates in `list2`. I'll amend my code to handle duplicates without the performance hit of searching them multiple times — entropy, Mar 04 '13 at 10:41
@entropy Thanks for your help. The reason I need to keep duplicates in list2 is that the I want to know how many times substrings show up in bigstrings in list2 . If duplicates are removed, one count may not mean it only show up once. — xiaozhu123, Mar 04 '13 at 10:53
@xiaozhu123 I've amended my code to keep track of duplicates — entropy, Mar 04 '13 at 10:56

entropy · Answer 1 · 2013-03-04T11:25:35.233

Your code doesn't seem like it works(are you sure you didn't just quote it from memory instead of pasting the actual code?)

For example, this line:

substring=line[:-1].split("\t")

will cause substring t be a list. But later you do:

if bigstring.find(substring)!=-1:

That would cause an error if you call str.find(list).

In any case, you are building lists uselessly in your innermost loop. This:

IDlist=[]
if bigstring.find(substring)!=-1:
     IDlist.append(ID)

 #later
 len(IDlist)

That will uselessly allocate and free lists which would cause memory thrashing as well as uselessly bogging everything down.

This is code that should work and uses more efficient means to do the counting:

from collections import defaultdict

dictionary = defaultdict(int)
with open(file2) as fd:
    for line in fd:
        for s in line.split("\t"):
            dictionary[s.strip()] += 1

with open(file1) as fd:
    for line in fd:
        for substring in line.split('\t'):
            count = 0
            for bigstring,num_occurrences in dictionary.items():
                if substring in bigstring:
                    count += num_occurrences
            print substring, count

PS: I am assuming that you have multiple words per line that are tab-split because you do line.split("\t") at some point. If that is wrong, it should be easy to revise the code.

PPS: If this ends up being too slow for your use(you'd have to try it, but my guess is this should run in ~10min given the number of strings you said you had). You'll have to use suffix trees as one of the comments suggested.

Edit: Amended the code so that it handles multiple occurrences of the same string in file2 without negatively affecting performance

Edit 2: Trading maximum space for time.

Below is code that will consume quite a bit of memory and take a while to build the dictionary. However, once that's done, each search out of the million strings to search for should complete in the time it takes for a single hashtable lookup, that is O(1).

Note, I have added some statements to log the time it takes for each step of the process. You should keep those so you know which part of the time is taken when searching. Since you are testing with 1000 strings only this matters a lot since if 90% of the cost is the build time, not the search time, then when you test with 1M strings you will still only be doing that once, so it won't matter

Also note that I have amended my code to parse file1 and file2 as you do, so you should be able to just plug this in and test it:

from Bio import SeqIO
from collections import defaultdict
from datetime import datetime

def all_substrings(s):
    result = set()
    for length in range(1,len(s)+1):
        for offset in range(len(s)-length+1):
            result.add(s[offset:offset+length])
    return result

print "Building dictionary...."
build_start = datetime.now()

dictionary = defaultdict(int)
handle = SeqIO.parse(file2, 'fasta')
for record in handle:
    for sub in all_substrings(str(record.seq).strip()):
        dictionary[sub] += 1
build_end = datetime.now()
print "Dictionary built in: %gs" % (build-end-build_start).total_seconds()

print "Searching...\n"
search_start = datetime.now()

with open(file1) as fd:
    for line in fd:
        substring = line.strip().split("\t")[1]
        count = dictionary[substring]
        print substring, count

search_end = datetime.now()
print "Search done in: %gs" % (search_end-search_start).total_seconds()

Your revised code mentions `bigstring` but does not define `bigstring`. Did you mean `if substring in dictionary` instead? — unutbu, Mar 03 '13 at 21:13
entropy, thanks for you answer! you are right. I have multiple words in line . it should be like substring=line[:-1].split("\t")[-1]. but I can't use bigstring as key in dictionary, because bigstring is not unique, only the ID of bigstring is unique. — xiaozhu123, Mar 03 '13 at 21:20
@unutbu there's a line that goes `for bigstring in dictionary` so yes it is defined — entropy, Mar 03 '13 at 22:27
@xiaozhu123 my `dictionary` is different from yours. Mine is simply a set of all words in `file1` It is built in the first 5 lines of my sample code. It would appear that you have a different way of building it. Also `substring = line[:-1].split("\t")[-1]` would only select the last word on the line. Is that correct? Is only the last word significant? Or are they all? My code assumes that they are all significant, if that's not the case I can revise it. What would be really helpful is for you to include all your code. Ie, the code you state takes hours to run, not just a sample. — entropy, Mar 03 '13 at 22:42
@entropy: i just post file1 and file2 through a link. In line of file1, after split by tab, the last element is the substring I want to search in file2. From your code, it seems it is searching strings from file2 in file1. Anyway, I will paste my code tomorrow. I already closed my labtop and went to bed. This is replied on cellphone :) — xiaozhu123, Mar 03 '13 at 23:19
@xiaozhu123 I don't see a link anywhere. You should edit your question to add it there. You are correct that I'm searching the wrong way around, I've edited to the code to fix that — entropy, Mar 03 '13 at 23:50
@entropy I tested your code and put the results above, you can have a look if you are interested in the performance. — xiaozhu123, Mar 04 '13 at 09:22
@entropy I changed the code as you have amended, which i thought it should run faster since it avoid searching duplicates, but actually it takes 1m10s to finish 1000 search. Weird! — xiaozhu123, Mar 04 '13 at 11:11
@xiaozhu123 that's probably because of an increased dictionary build time not increased search time. In any case, I have posted some code which should be faster than my previous one by orders of magnitude. Note that on 1000 strings it will probably be slower, but on 1M it will be way way faster. I've added timing logging so you know what's a one time cost(dictionary build) and what's the variable cost as you have more strings(search time). My guess is that the search time will be tiny and that if build time runs in ~1m you'll be able to search your entire 1M strings in ~2m instead of hours — entropy, Mar 04 '13 at 11:27
@entropy, I tried you new method, which couldn't finish building dictionary for 40mins. I stopped the process because I think it will need more memory than my laptop has(it has already taken 5G,the other 3G occupied by other programs). Looking at your new code again, If I understand right, I found you try to build a dictionary which stores all possible substrings spliced from the bigstrings in file2, while the bigstrings in file2 can be as long as 5000 or more. I guess the number of all possible substrings is too big and needs much more memory space than my laptop has. — xiaozhu123, Mar 04 '13 at 13:11
@xiaozhu123 ah, you didn't mention that in your question. I assumed all strings were 3~40 but you said that was just for file1 and didn't specify for file2. I agree that in this case it would be way too expensive to build the dictionary and you should not do it. — entropy, Mar 04 '13 at 13:16

Mark Amery · Answer 2 · 2013-03-03T22:20:08.083

I'm not an algorithms whiz, but I reckon this should give you a healthy performance boost. You need to set 'haystacks' to be a list of the big words you want to look in, and 'needles' to be a list of the substrings you're looking for (either can contain duplicates), which I'll let you implement on your end. It'd be great if you could post your list of needles and list of haystacks so that we can easily compare performance of proposed solutions.

haystacks = <some list of words>
needles = <some list of words>

class Node(object):
    def __init__(self):
        self.words = set()
        self.links = {}

base = Node()

for haystack_id, haystack in enumerate(haystacks):
    for i in xrange(len(haystack)):
        current_node = base
        for char in haystack[i:]:
            current_node = current_node.links.setdefault(char, Node())
            current_node.words.add(haystack_id)

def search_haystack_tree(needle):
    current_node = base
    for char in needle:
        try:
            current_node = current_node.links[char]
        except KeyError:
            return 0
    return len(current_node.words)

for needle in needles:
    count = search_haystack_tree(needle)
    print "Count for %s: %i" % (needle, count)

You can probably figure out what's going on by looking at the code, but just to put it in words: I construct a huge tree of substrings of the haystack words, such that given any needle, you can navigate the tree character by character and end up at a node which has attached to it the set of all haystack ids of haystacks containing that substring. Then for each needle we just go through the tree and count the size of the set at the end.

I should mention as an aside that the term 'suffix tree' used by Bakuriu gave me the idea to approach the problem this way, but I couldn't really understand the Wikipedia article he linked to so I haven't got the slightest idea whether this solution uses a suffix tree or not. :) — Mark Amery, Mar 03 '13 at 22:16
Thanks! I really appreciate the efforts all of you have made. I don't know how to upload file to stack flow. I created a public folder in dropbox, which you should be able to download it through this link. https://www.dropbox.com/sh/oj62918p83h8kus/sY2WejWmhu?m — xiaozhu123, Mar 03 '13 at 22:40
It is midnight in my timezone now, I will try your "haystacks and needles" method on tomorrow morning. Thanks again! — xiaozhu123, Mar 03 '13 at 22:44
I tested your code, it ran really fast, you can see comparison above. But the problem is counting result is not right — xiaozhu123, Mar 04 '13 at 09:57
@xiaozhu123 I don't have time to look thoroughly right now (will do after work), but at a glance it looks like you've made 'haystacks' and 'needles' be dictionaries, where my code expects them to be lists (which are free to contain duplicates). — Mark Amery, Mar 04 '13 at 10:10

Python search one million strings in a file and count occurrences of each string

2 Answers2