This is more about to find the fastest way to do it. I have a file1 which contains about one million strings(length 6-40) in separate line. I want to search each of them in another file2 which contains about 80,000 strings and count occurrence(if small string is found in one string multiple times, the occurence of this string is still 1). If anyone is interested to compare performance, there is link to download file1 and file2. dropbox.com/sh/oj62918p83h8kus/sY2WejWmhu?m
What i am doing now is construct a dictionary for file 2, use strings ID as key and string as value. (because strings in file2 have duplicate values, only string ID is unique) my code is
for line in file1:
substring=line[:-1].split("\t")
for ID in dictionary.keys():
bigstring=dictionary[ID]
IDlist=[]
if bigstring.find(substring)!=-1:
IDlist.append(ID)
output.write("%s\t%s\n" % (substring,str(len(IDlist))))
My code will take hours to finish. Can anyone suggest a faster way to do it? both file1 and file2 are just around 50M, my pc have 8G memory, you can use as much memory as you need to make it faster. Any method that can finish in one hour is acceptable:)
Here, after I have tried some suggestions from these comments below, see performance comparison, first comes the code then it is the run time.
Some improvements suggested by Mark Amery and other peoplesimport sys
from Bio import SeqIO
#first I load strings in file2 to a dictionary called var_seq,
var_seq={}
handle=SeqIO.parse(file2,'fasta')
for record in handle:
var_seq[record.id]=str(record.seq)
print len(var_seq) #Here print out 76827, which is the right number. loading file2 to var_seq doesn't take long, about 1 second, you shall not focus here to improve performance
output=open(outputfilename,'w')
icount=0
input1=open(file1,'r')
for line in input1:
icount+=1
row=line[:-1].split("\t")
ensp=row[0] #ensp is just peptides iD
peptide=row[1] #peptides is the substrings i want to search in file2
num=0
for ID,bigstring in var_seq.iteritems():
if peptide in bigstring:
num+=1
newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
output.write(newline)
if icount%1000==0:
break
input1.close()
handle.close()
output.close()
It will take 1m4s to finish. Improved 20s compared to my old one
#######NEXT METHOD suggested by entropyfrom collections import defaultdict
var_seq=defaultdict(int)
handle=SeqIO.parse(file2,'fasta')
for record in handle:
var_seq[str(record.seq)]+=1
print len(var_seq) # here print out 59502, duplicates are removed, but occurances of duplicates are stored as value
handle.close()
output=open(outputfilename,'w')
icount=0
with open(file1) as fd:
for line in fd:
icount+=1
row=line[:-1].split("\t")
ensp=row[0]
peptide=row[1]
num=0
for varseq,num_occurrences in var_seq.items():
if peptide in varseq:
num+=num_occurrences
newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
output.write(newline)
if icount%1000==0:
break
output.close()
This one takes 1m10s,not faster as expected since it avoids searching duplicates, don't understand why.
Haystack and Needle method suggested by Mark Amery, which turned out to be the fastest, The problem of this method is that counting result for all substrings is 0, which I don't understand yet.Here is the code I implemented his method.
class Node(object):
def __init__(self):
self.words = set()
self.links = {}
base = Node()
def search_haystack_tree(needle):
current_node = base
for char in needle:
try:
current_node = current_node.links[char]
except KeyError:
return 0
return len(current_node.words)
input1=open(file1,'r')
needles={}
for line in input1:
row=line[:-1].split("\t")
needles[row[1]]=row[0]
print len(needles)
handle=SeqIO.parse(file2,'fasta')
haystacks={}
for record in handle:
haystacks[record.id]=str(record.seq)
print len(haystacks)
for haystack_id, haystack in haystacks.iteritems(): #should be the same as enumerate(list)
for i in xrange(len(haystack)):
current_node = base
for char in haystack[i:]:
current_node = current_node.links.setdefault(char, Node())
current_node.words.add(haystack_id)
icount=0
output=open(outputfilename,'w')
for needle in needles:
icount+=1
count = search_haystack_tree(needle)
newline="%s\t%s\t%s\n" % (needles[needle],needle,str(count))
output.write(newline)
if icount%1000==0:
break
input1.close()
handle.close()
output.close()
It takes only 0m11s to finish, which is much faster than other methods. However, I don't know it is my mistakes to make all counting result as 0, or there is a flaw in the Mark's method.