I am looking for suggestions on how to speed up the process described below, which involves a fuzzy regex search.
What I am trying to do
I am fuzzy searching for keywords
, stored in a dictionary d
(with example just below, value is list of two always, need to keep track of which was found, if any), in a set of strings
, stored in a file testFile
(one string each line, ~150 characters each) - 4 mismatches max.
d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"], "kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]} #simplified to just two keywords
How I do it
For this, I first compile my regex and store them in a dictionary compd
. I then read the file line by line and search each keyword in each line (string). I cannot stop the search once a keyword has been found as multiple keywords may be found in one string/line but I can skip the second element in list associated with keyword if first is found.
Here is how I am doing it:
#/usr/bin/env python3
import argparse
import regex
parser = argparse.ArgumentParser()
parser.add_argument('file', help='file with strings')
args = parser.parse_args()
#dictionary with keywords
d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"],"kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]}
#Compile regex (4 mismatches max)
compd = {"kw1": [], "kw2": []} #to store regex
for k, v in d.items(): #for each keyword
compd[k].append(regex.compile(r'(?b)(' + v[0] + '){s<=4}')) #compile 1st elt of list
compd[k].append(regex.compile(r'(?b)(' + v[1] + '){s<=4}')) #compile second
#Search keywords
with open(args.file) as f: #open file with strings
line = f.readline() #first line/string
while line: #go through each line
for k, v in compd.items(): #for each keyword (ID, regex)
for val in [v[0], v[1]]: #for each elt of list
found = val.search(line) #regex search
if found != None: #if match
print("Keyword " + k + " found as " + found[0]) #print match
if val == v[0]: #if 1st elt of list
break #don't search 2nd
line = f.readline() #next line
I have tested the script using the testFile
:
AGCTCGATGTATGGGTATATGATCTTGACAGAGAGA
GTCGTAGCTCGTATTCGATGGCTATTCGCTATATGCTAGCTAT
and get the following expected result:
Keyword kw1 found as AGCTCGATGTATGGGTATATGATCTTGAC
Efficiency
With current script, it takes about 3-4 minutes to process 500k strings and six keywords. There will be cases where I have 2 million strings, which should take 12-16 minutes and I would like to have this reduced, if possible.