0

I am looking for suggestions on how to speed up the process described below, which involves a fuzzy regex search.

What I am trying to do

I am fuzzy searching for keywords, stored in a dictionary d (with example just below, value is list of two always, need to keep track of which was found, if any), in a set of strings, stored in a file testFile (one string each line, ~150 characters each) - 4 mismatches max.

d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"], "kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]} #simplified to just two keywords

How I do it

For this, I first compile my regex and store them in a dictionary compd. I then read the file line by line and search each keyword in each line (string). I cannot stop the search once a keyword has been found as multiple keywords may be found in one string/line but I can skip the second element in list associated with keyword if first is found.

Here is how I am doing it:

#/usr/bin/env python3

import argparse
import regex

parser = argparse.ArgumentParser()
parser.add_argument('file', help='file with strings')
args = parser.parse_args()

#dictionary with keywords
d = {"kw1": ["AGCTCGATGTATGGGTATATGATCTTGAC", "GTCAAGATCATATACCCATACATCGAGCT"],"kw2": ["GGTCAGGTCAGTACGGTACGATCGATTTCGA", "TCGAAATCGATCGTACCGTACTGACCTGACC"]}

#Compile regex (4 mismatches max)
compd = {"kw1": [], "kw2": []}  #to store regex
for k, v in d.items():  #for each keyword
    compd[k].append(regex.compile(r'(?b)(' + v[0] + '){s<=4}')) #compile 1st elt of list 
    compd[k].append(regex.compile(r'(?b)(' + v[1] + '){s<=4}')) #compile second

#Search keywords
with open(args.file) as f:  #open file with strings
    line = f.readline()  #first line/string
    while line:  #go through each line
        for k, v in compd.items():  #for each keyword (ID, regex)
            for val in [v[0], v[1]]:  #for each elt of list
                found = val.search(line)  #regex search
                if found != None:  #if match
                    print("Keyword " + k + " found as " + found[0]) #print match
                    if val == v[0]:   #if 1st elt of list
                        break   #don't search 2nd
        line = f.readline() #next line
    

I have tested the script using the testFile:

AGCTCGATGTATGGGTATATGATCTTGACAGAGAGA
GTCGTAGCTCGTATTCGATGGCTATTCGCTATATGCTAGCTAT

and get the following expected result: Keyword kw1 found as AGCTCGATGTATGGGTATATGATCTTGAC

Efficiency

With current script, it takes about 3-4 minutes to process 500k strings and six keywords. There will be cases where I have 2 million strings, which should take 12-16 minutes and I would like to have this reduced, if possible.

Agathe
  • 303
  • 2
  • 15
  • Maybe have a look at bitwise operations i guess. Also what happens in the case: kw1:[a,b] and you find b. Do you skip a and still assign kw1 was found? Or is it always only the first in the list that will be found. – Jason Chia Sep 28 '22 at 14:50
  • @JasonChia a or b can be found but it cannot be both. I will always start by looking for a (have to start somewhere) and if not found, will look for b. if b is found, kw1 should be marked as found. Will take a look at bitwise operations, thanks – Agathe Sep 28 '22 at 15:05
  • @JasonChia Bitwise operations, you mean for example group a and b for each keyword so that I do only one search per keyword? I could but wondering how to keep track on whether it is a or b that was found. – Agathe Sep 28 '22 at 15:15

1 Answers1

2

Having a separate regex for each keyword requires running a match against each regex separately. Instead, combine all the regexes into one using the keywords as names for named groups:

patterns = []
for k, v in d.items():  #for each keyword
    patterns.append(f'(?P<{k}>{v[0]}|{v[1]})')
pattern = '(?b)(?:' + '|'.join(patterns) + '){s<=4}'
reSeqs = regex.compile(pattern)

With this, the program can check for which named group was matched in order to get the keyword. You can replace the loop over all the regexes in compd with loops over matches in a line (in case there is more than 1 match) and dictionary items in each match (which could be implemented as a comprehension):

for matched in reSeqs.finditer(line):
    try:
        keyword = [kw for kw, val in matched.groupdict().items() if val][0]
        # perform further processing of keyword
    except: # no match
        pass

(Note that you don't need to call readline on a file object to loop over lines; instead, you can loop over the file object directly: for line in f:.)

If you need further optimizations, have memory to burn and can sacrifice a little readability, also test whether replacing the loop over lines with a comprehension over matches is more performant:

with open(args.file) as f:
    contents = f.read() # slurp entire file
    matches = [{
          val:kw for kw, val in found.groupdict().items() if val
        } for found in reSeqs.finditer(contents)
      ]

This solution doesn't distinguish between repetitions of a given sequence; in particular, repetitions on a single line are lost. You could merge entries having the same keys into lists, or, if repetitions should be treated as a single instance, you can merge the dictionaries as-is. If you want to distinguish separate instances of a matched sequence, include file position information in the keys:

matches = [{
      (val, found.span()):kw for kw, val in found.groupdict().items() if val
    } for found in reSeqs.finditer(contents)
  ]

To merge:

results = {}
for match in matches:
    results.update(match)
# or:
results = {k:v for d in matches for k,v in d.items()}

If memory is an issue, another option would be to break up the file into chunks ending on line breaks (either line-based, or by reading blocks and separating partial lines at block ends) and use finditer on each chunk:

# implementation of `chunks` left as exercise
def file_chunks(path, size=2**12):
    with open(path) as file:
        yield from chunks(file, size=size)

results = {}
for block in file_chunks(args.file, size=2**20):
    for found in reSeqs.finditer(block):
        results.update({
          (val, found.span()):kw for kw, val in found.groupdict().items() if val
        })
outis
  • 75,655
  • 22
  • 151
  • 221
  • Thanks for the detailed answer! I prefer using a line-by-line approach but getting one regex instead of 12 reduced the time by a minute (from 50s to 35s per 100k lines). I had to change the name of my keywords because there were dashes in them (didn't say in post) and could not use walrus operator := because I am on Python 3.7 but all else was fine! – Agathe Sep 29 '22 at 15:06
  • @Agathe: I had missed the requirement to handle multiple matches in a single line; answer updated. Though it's not much of a change, it did remove the use of the walrus. As for the keyword format, it's probably better that's not in the question, as addressing it is a separate issue (and easily accomplished, which you did). I highly recommend [profiling](https://docs.python.org/3/library/profile.html) a few different implementations. – outis Sep 30 '22 at 18:18