Given a string and a list of substring that should be replaces as placeholders, e.g.
import re
from copy import copy
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
The first goal is to first replace the substrings from phrases
in the original_text
with indexed placeholders, e.g.
text = copy(original_text)
backplacement = {}
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
[out]:
Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen
Then there'll be some functions to manipulate the text
with the placeholders, e.g.
cleaned_text = func('Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen')
print(cleaned_text)
that outputs:
MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2
the last step is to do the replacement we did in a backwards manner and put back the original phrases, i.e.
' '.join([backplacement[tok] if tok in backplacement else tok for tok in clean_text.split()])
[out]:
"'s_morgen ik 's-Hertogenbosch depository_financial_institution"
The questions are:
- If the list of substrngs in
phrases
is huge, the time to do the 1st replacement and the last backplacement would take very long.
Is there a way to do the replacement/backplacement with a regex?
- using the
re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
regex substitution isn't very helpful esp. if there are substrings in the phrases that matches not the full word,
E.g.
phrases = ["org", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
we get an awkward output:
Something, 's mMWEPHRASE0en, ik MWEPHRASE1 im das MWEPHRASE2 gehen
I've tried using '\b{}\b'.format(phrase)
but that'll didn't work for the phrases with punctuations, i.e.
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"\b{}\b".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
[out]:
Something, 's morgen, ik 's-Hertogenbosch im das MWEPHRASE2 gehen
Is there some where to denote the word boundary for the phrases in the re.sub
regex pattern?