Speeding up Python regex compilation across many regexes with duplicated content

Question

I have a situation where I have thousands of Python regexes that look something like:

r'...(?:uuid)...'

Where ... isn't literally three regex wildcards, but rather represents long strings with slow compilation time.

The strings these regexes are matching may contain multiple uuids from the same set of uuids, but for each string there is only one specific uuid that is supposed to match; this is why there is currently a different regex for each string. For example, if the string is like:

'...uuid1...uuid2...uuid3...'

We might be looking for only uuid2, so it's not a simple matter of just writing a regex like:

r'...(?:uuid1|uuid2|uuid3)...'

The obvious solution would be to start by compiling these regexes in separate processes, and returning the compiled regexes as pickled objects, but it looks like Python doesn't really support that.

One solution that might work would be splitting this into 3 regexes:

r'...'
r'(?:uuid)'
r'...'

There would still be thousands of regexes to compile, but at least they would each compile somewhat faster. However I'd have to write a bunch of difficult-to-understand functions to implement the matching

A second solution might be doing something like:

r'...(uuid1|uuid2|uuid3)...'

But then using finditer and for each match, checking to see if it contains the correct uuid, and if not just going to the next match. I don't really like this solution either, but it doesn't seem completely unreasonable to try it and measure the performance.

Am I missing a better way of doing this? All of these solutions seem frankly not that good.

Check [Speed up millions of regex replacements in Python 3](https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3) — Wiktor Stribiżew, Aug 25 '20 at 20:03
@WiktorStribiżew In my case the matching is already basically instantaneous, it's the compilation that's slow. — Alex3917, Aug 25 '20 at 20:29
Use `str.split(uuid)` to find the uuid if it exists, and its context. Have two compiled regexes: the first part anchored at the end, and the second part anchored at the beginning. Apply these to the two parts of the split. (If there's any possibility of the uuid matching multiple times, apply the two regexes to each overlapping pair of strings in the split results.) — jasonharper, Aug 25 '20 at 20:45

Speeding up Python regex compilation across many regexes with duplicated content

0 Answers0