Regex replace is taking time for millions of documents, how to make it faster?

Question

I have documents like:

documents = [
    "I work on c programing.",
    "I work on c coding.",
]

I have synonym file such as:

synonyms = {
    "c programing": "c programing",
    "c coding": "c programing"
}

I want to replace all synonyms for which I wrote this code:

# added code to pre-compile all regex to save compilation time. credits alec_djinn

compiled_dict = {}
for value in synonyms:
    compiled_dict[value] = re.compile(r'\b' + re.escape(value) + r'\b')

for doc in documents:
    document = doc
    for value in compiled_dict:
        lowercase = compiled_dict[value]
        document = lowercase.sub(synonyms[value], document)
    print(document)

Output:

I work on c programing.
I work on c programing.

But since the number of documents is a few million and the number of synonym terms are in 10s of thousands, the expected time for this code to finish is 10 days approx.

Is there a faster way to do this?

PS: with the output I want to train word2vec model.

Any help is greatly appreciated. I was thinking of writing some cpython code and putting it in parallel threads.

@AmeyDahale corrected the code. Thanks for pointing it out.. — Vikash Singh, May 25 '17 at 10:44
let's elaborate: should the crucial word like `c++` be placed after `on` word and at the end of the string? If the format is fixed, I can suggest some solution — RomanPerekhrest, May 25 '17 at 10:48
The example would not work, since the final `\b` does not match between `++` and `.`. The `++` would need to be followed by a digit, letter or underscore if `c ++` would have to match with that pattern. — trincot, May 25 '17 at 10:48
try making `for document in documents` to be the inner loop of synonyms, and moving making regexp outside of document loop. See [this](https://stackoverflow.com/q/23107563/5831538) post — Oleksandr Muliar, May 25 '17 at 10:51
@OleksandrMuliar thanks thats one good approach. Currently I keep a map of all regex object and use that. Removed it from question for simplification. — Vikash Singh, May 25 '17 at 11:13
Do not try this using threads - multiple threads on Python never run in parallel due to GIL and execute slower than a single thread due to context switches. Second, what are your 'documents'? Files? Third, why are you using regex for simple replacements (let alone compiling a new one for each value) - Python's native string replace would work significantly faster than running the regex engine. — zwer, May 25 '17 at 11:23
Strongly related: https://stackoverflow.com/a/48600345/4909087 Scroll down to unutbu's answer detailing the exact same thing. — cs95, May 04 '18 at 15:23

score 20 · Accepted Answer · answered May 25 '17 at 12:51

20

I have done string replacement jobs like this before, also for training word2vec models on very large text corpora. When the number of terms to replace (your "synonym terms") is very large, it can make sense to do string replacement using the Aho-Corasick algorithm instead of looping over many single string replacements. You can take a look at my fsed utility (written in Python), which might be useful to you.

answered May 25 '17 at 12:51

wildwilhelm

4,809
1
19
24

19

Hey, I tried Aho-Corasick Algo it worked wonders. Thanks. I had to write my own implementation as well https://github.com/vi3k6i5/synonym-extractor/ – Vikash Singh Jul 02 '17 at 18:23

Ovidiu Dolha · Answer 2 · 2017-05-25T11:34:46.353

1

Steps I'd take:

Create a direct algorithm without regex. Maybe event generate code based on the synonyms directly.
Partition the work on the documents so that you can run this algorithm directly on N/x documents, and split to make the most of the parallel resources (e.g. x = 4 if you have 4 cores) and run using a parallel approach (note: avoid using threads)
Maybe use a library to help with running this on parallel and if you have the resources on multiple nodes (e.g. using spark).

edited May 25 '17 at 11:34

answered May 25 '17 at 10:43

Ovidiu Dolha

5,335
1
21
30

1

To add the warning here as well: regarding #2 - do not use threads for this, it will work slower than a single thread no matter how many cores you have. – zwer May 25 '17 at 11:24
I wrote an implementation based on @wildwilhelm 's recommendation https://github.com/vi3k6i5/synonym-extractor/. Which is running much faster. Thanks for all your help though. – Vikash Singh Jul 02 '17 at 18:24

score 1 · Answer 3 · answered May 25 '17 at 12:02

1

I would precompile all the regex strings and put them in a dict. In this way you avoid to compile over and over the same value. It will save lot's of time.

Your main loop would then become:

compiled_dict = {}
for value in synonyms:
        compiled_dict[value] = re.compile(r'\b' + re.escape(value) + r'\b')


for document in documents:
    for value in synonyms:
        lowercase = compiled_dict[value]
        document = lowercase.sub(synonyms[value], document)

answered May 25 '17 at 12:02

alec_djinn

10,104
8
46
71

I have done this. But to simplify the question I took it out of code. But valid point. It saves time. – Vikash Singh May 25 '17 at 12:38
I wrote an implementation based on @wildwilhelm 's recommendation https://github.com/vi3k6i5/synonym-extractor/. Which is running much faster. Thanks for all your help though. – Vikash Singh Jul 02 '17 at 18:24

score 0 · Answer 4 · answered May 25 '17 at 13:00

First of all, the code you provided will not find c++ in c++. as the terminating dot does not give a match with \b. Instead of \b you could use (?!\w).

Assuming that most synonyms will be for single words (without spaces and special characters), you could get some optimisation by looking at each actual word in the document and replace it when it occurs as such in the synonyms list.

All the remaining synonyms keys (hopefully a minority) could then be treated like you did, but by first turning the keys into their regex equivalent.

Here is how that would look:

import re

# Get those synonyms that are not single words and turn them into regexes:
# Don't use \b to end a pattern; just require that no \w should follow 
complex_synonyms = [(r'\b' + re.escape(key) + r'(?!\w)', synonyms[key]) for key in synonyms if not re.match(r'[\w+]+$', key)]

for i, document in enumerate(documents):
    # Deal with the easy cases (words) in one go, by checking each word in the document
    document = re.sub(r'[\w+]+', lambda word: synonyms[word[0]] if word[0] in synonyms else word[0], document)
    # Replace the remaining synonyms by using regular expressions
    for find, repl in complex_synonyms:
        document = re.sub(find, repl, document)
    # Store the result back into the document
    documents[i] = document

Updated my example so that it works. I had written a custom example for it. The problem is not the regex rather scaling matching of terms and replacing it. — Vikash Singh, May 25 '17 at 13:19
Of course, the scaling is what I got from your question. Did you try this approach? If most of your synonym keys are words (no spaces, no punctuation), I expect this to be faster. — trincot, May 25 '17 at 19:30
So I understand that one word synonyms can be optimised like this. Then I guess 2 word synonyms can be optimised by 2 grams and 3 word synonyms with 3 grams. That part is cool. But won't be as fast as I need it to be. Still I will try that. With a good tokenizer. — Vikash Singh, May 25 '17 at 19:32
I wrote an implementation based on @wildwilhelm 's recommendation https://github.com/vi3k6i5/synonym-extractor/. Which is running much faster. Thanks for all your help though. — Vikash Singh, Jul 02 '17 at 18:25

Regex replace is taking time for millions of documents, how to make it faster?

4 Answers4