I have documents like:
documents = [
"I work on c programing.",
"I work on c coding.",
]
I have synonym file such as:
synonyms = {
"c programing": "c programing",
"c coding": "c programing"
}
I want to replace all synonyms for which I wrote this code:
# added code to pre-compile all regex to save compilation time. credits alec_djinn
compiled_dict = {}
for value in synonyms:
compiled_dict[value] = re.compile(r'\b' + re.escape(value) + r'\b')
for doc in documents:
document = doc
for value in compiled_dict:
lowercase = compiled_dict[value]
document = lowercase.sub(synonyms[value], document)
print(document)
Output:
I work on c programing.
I work on c programing.
But since the number of documents is a few million and the number of synonym terms are in 10s of thousands, the expected time for this code to finish is 10 days approx.
Is there a faster way to do this?
PS: with the output I want to train word2vec model.
Any help is greatly appreciated. I was thinking of writing some cpython code and putting it in parallel threads.