6

I have a program where I need to compile several thousand large regexes, all of which will be used many times. Problem is, it takes too long (according to cProfiler, 113 secs) to re.compile() them. (BTW, actually searching using all of these regexes < 1.3 secs once compiled.)

If I don't precompile, it just postpones the problem to when I actually search, since re.search(expr, text) implicitly compiles expr. Actually, it's worse, because re is going to recompile the entire list of regexes every time I use them.

I tried using multiprocessing, but that actually slows things down. Here's a small test to demonstrate:

## rgxparallel.py ##
import re
import multiprocessing as mp

def serial_compile(strings):
    return [re.compile(s) for s in strings]

def parallel_compile(strings):
    print("Using {} processors.".format(mp.cpu_count()))
    pool = mp.Pool()
    result = pool.map(re.compile, strings)
    pool.close()
    return result

l = map(str, xrange(100000))

And my test script:

#!/bin/sh
python -m timeit -n 1 -s "import rgxparallel as r" "r.serial_compile(r.l)"
python -m timeit -n 1 -s "import rgxparallel as r" "r.parallel_compile(r.l)"
# Output:
#   1 loops, best of 3: 6.49 sec per loop
#   Using 4 processors.
#   Using 4 processors.
#   Using 4 processors.
#   1 loops, best of 3: 9.81 sec per loop

I'm guessing that the parallel version is:

  1. In parallel, compiling and pickling the regexes, ~2 secs
  2. In serial, un-pickling, and therefore recompiling them all, ~6.5 secs

Together with the overhead for starting and stopping the processes, multiprocessing on 4 processors is more than 25% slower than serial.

I also tried divvying up the list of regexes into 4 sub-lists, and pool.map-ing the sublists, rather than the individual expressions. This gave a small performance boost, but I still couldn't get better than ~25% slower than serial.

Is there any way to compile faster than serial?

EDIT: Corrected the running time of the regex compilation.

I also tried using threading, but due to GIL, only one processor was used. It was slightly better than multiprocessing (130 secs vs. 136 secs), but still slower than serial (113 secs).

EDIT 2: I realized that some regexes were likely to be duplicated, so I added a dict for caching them. This shaved off ~30 sec. I'm still interested in parallelizing, though. The target machine has 8 processors, which would reduce compilation time to ~15 secs.

Chaim Leib Halbert
  • 2,194
  • 20
  • 23
  • How come you have so many large regexes and only do so little searching with them? Can you simplify them, perhaps replace them with plain old string manipulation, or avoid running some of them at all? –  Aug 15 '13 at 23:06
  • 1
    The time for searching is for a single use of the entire list. It is very important that the time for a single list search is small, because the user (and my employer) will be expecting near-instant response. I tried simplifying as much as I could, and this is the best I could get without cutting out major features. (The actual list of search terms is ~200,000 items; I have code that switches to simple string functions whenever possible, but that still leaves ~5,000 regexes.) – Chaim Leib Halbert Aug 15 '13 at 23:15
  • Have you tried using threads instead? 1 thread per cpu and the regex's divvied up among them? regex is implemented in C so you should get a decent level of parallelism despite the GIL. – tdelaney Aug 15 '13 at 23:23
  • 1
    I should link that http://xkcd.com/1171/ =) – kirilloid Aug 15 '13 at 23:26
  • I was going to try that, but I was put off by this warning in the threading docs (I'm using CPython): In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously. – Chaim Leib Halbert Aug 15 '13 at 23:26
  • Your question is interesting from a technical perspective but my gut feeling, just from reading the question, is that we're talking about premature optimization. It's out of scope for this question but maybe there's a better way to solve your problem than a solution that requires to deal with compiling of >5000 regular expressions in parallel. Have the expressions been manually written or are they autogenerated and could maybe be simplified? – koks der drache Oct 07 '19 at 06:01
  • The question is from an old contract job, so I no longer have to solve it. But the regexes were autogenerated. There were various wildcard keys to indicate spelled-out numbers and things, so the original problem might have been better solved by an NLP library. – Chaim Leib Halbert Oct 07 '19 at 21:16
  • In other work, I wrote a parser for robots.txt patterns based on http://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html, which might be useful to devs only needing a subset of Regex functionality. – Chaim Leib Halbert Oct 07 '19 at 21:18

2 Answers2

0

There are lighter solutions than multiprocessing to get asynchronism of task execution, like threads and coroutines. Though python2 is limited in its capability to run things simultaneously, python3 largely uses such asynchronous implementations within its fundamental types. Just run your code with python3 and you will see the difference:

$ python2 --version
Python 2.7.17
$ python2 -m timeit -n 1 -s "import rgxparallel as r" "r.serial_compile(r.l)"
1 loops, best of 3: 3.65 sec per loop
$ python -m timeit -n 1 -s "import multire as r" "r.parallel_compile(r.l)"
1 loops, best of 3: 3.95 sec per loop

$ python3 --version
Python 3.6.9
$ python3 -m timeit -n 1 -s "import multire as r" "r.serial_compile(r.l)"
1 loops, best of 3: 0.72 usec per loop
$ python3 -m timeit -n 1 -s "import multire as r" "r.parallel_compile(r.l)"
...
1 loops, best of 3: 4.51 msec per loop

Do not forget to change the xrange by range for the python3 version.

Tim
  • 2,052
  • 21
  • 30
-2

As much as I love python, I think the solution is, do it in perl (see this speed comparison, for example), or C, etc.

If you want to keep the main program in python, you could use subprocess to call a perl script (just make sure to pass as many values as possible in as few subprocess calls as possible to avoid overhead.

Community
  • 1
  • 1
Brionius
  • 13,858
  • 3
  • 38
  • 49