I have 10M texts (fits in RAM) and a python dictionary of a kind:
"old substring":"new substring"
The size of a dictionary is ~15k substrings.
I am looking for the FASTEST way to replace each text with the dict (to find every "old substring" in every text and to replace it with "new substring").
The source texts are in pandas dataframe. For now i have tried these approaches:
1) Replace in a loop with reduce and str replace (~120 rows/sec)
replaced = []
for row in df.itertuples():
replaced.append(reduce(lambda x, y: x.replace(y, mapping[y]), mapping, row[1]))
2) In loop with simple replace function ("mapping" is the 15k dict) (~160 rows/sec):
def string_replace(text):
for key in mapping:
text = text.replace(key, mapping[key])
return text
replaced = []
for row in tqdm(df.itertuples()):
replaced.append(string_replace(row[1]))
Also .iterrows() works 20% slower than .itertuples()
3) Using apply on Series (also ~160 rows/sec):
replaced = df['text'].apply(string_replace)
With these speed it take hours to process the whole dataset.
Anyone has experience with this kind of mass substring replacements? Is it possible to speed it up? It can be tricky or ugly but have to be as fast as possible, not necessary using pandas.
Thanks.
UPDATED: Toy data to check the idea:
df = pd.DataFrame({ "old":
["first text to replace",
"second text to replace"]
})
mapping = {"first text": "FT",
"replace": "rep",
"second": '2nd'}
result expected:
old replaced
0 first text to replace FT to rep
1 second text to replace 2nd text to rep