Fast, multiple regular expression string match with python

Question

I am trying to match many string regexes into one long string and counting delimiters on every match. Using multiprocessing to concurrently search many regex at once:

with open('many_regex', 'r') as f:
    sch = f.readlines()

with open('big_string', 'r') as f:
    text = f.read()

import re
def search_sch(sch,text = text):
    delim_index = []
    last_found = 0
    for match in re.finditer(sch, text):
        count_delims = len(re.findall('##', text[last_found:match.start()]))
        if delim_index:
            count_delims += delim_index[-1]
        delim_index.append(count_delims)
        last_found = match.end()
    return delim_index

from multiprocessing.dummy import Pool

with Pool(8) as threadpool:
    matches = threadpool.map(search_sch, sch[:100])

The threadpool.map takes about 25s to execute, and only a single CPU core being utilised. Any idea why more cores are not being used? Also, any python library to do this fast?

Maybe related: https://stackoverflow.com/questions/26432411/multiprocessing-dummy-in-python-is-not-utilising-100-cpu/26432431 — Pedro von Hertwig Batista, Dec 01 '17 at 12:20
This document suggests that threadpool is only useful for IO bound operations, since the GIL prevents two threads actually processing simultaneously: http://lucasb.eyer.be/snips/python-thread-pool.html . Perhaps there is a way to restructure with multiprocess pools? — Phil, Dec 01 '17 at 12:37
What's the structure of `sch`? Is it just a single long regex with alternations of static strings like `alice|bob|charlie|...` or something more-complex? In the more-complex case, could that involve backtracking? — tripleee, Dec 01 '17 at 12:51
`sch` is a complex regex and matching them would probably involve backtracking, but I am treating each regex independently, so I fail to get why is it relevant. — bob, Dec 01 '17 at 14:51

score 0 · Accepted Answer · answered Dec 01 '17 at 12:40

0

The Pool class from multiprocessing.dummy uses threading instead of multiprocessing. This means that the global interpreter lock is an issue. You want to use actual multiprocessing; for that, replace

from multiprocessing.dummy import Pool

for

from multiprocessing import Pool

answered Dec 01 '17 at 12:40

Pedro von Hertwig Batista

2,922
1
15
20

Now it keeps running for >10min on Windows 7, python 3 (anaconda 4.4), have to force stop. – bob Dec 02 '17 at 05:41
Running it from ipython console was the issue. From command prompt, works fine. – bob Dec 06 '17 at 04:34

Fast, multiple regular expression string match with python

1 Answers1