Fastest way to replace substrings with dictionary (On large dataset)

Question

I have 10M texts (fits in RAM) and a python dictionary of a kind:

"old substring":"new substring"

The size of a dictionary is ~15k substrings.

I am looking for the FASTEST way to replace each text with the dict (to find every "old substring" in every text and to replace it with "new substring").

The source texts are in pandas dataframe. For now i have tried these approaches:

1) Replace in a loop with reduce and str replace (~120 rows/sec)

replaced = []
for row in df.itertuples():
    replaced.append(reduce(lambda x, y: x.replace(y, mapping[y]), mapping, row[1]))

2) In loop with simple replace function ("mapping" is the 15k dict) (~160 rows/sec):

def string_replace(text):
    for key in mapping:
        text = text.replace(key, mapping[key])
    return text

replaced = []
for row in tqdm(df.itertuples()):
    replaced.append(string_replace(row[1]))

Also .iterrows() works 20% slower than .itertuples()

3) Using apply on Series (also ~160 rows/sec):

replaced = df['text'].apply(string_replace)

With these speed it take hours to process the whole dataset.

Anyone has experience with this kind of mass substring replacements? Is it possible to speed it up? It can be tricky or ugly but have to be as fast as possible, not necessary using pandas.

Thanks.

UPDATED: Toy data to check the idea:

df = pd.DataFrame({ "old":
                    ["first text to replace",
                   "second text to replace"]
                    })

mapping = {"first text": "FT", 
           "replace": "rep",
           "second": '2nd'}

result expected:

                      old         replaced
0   first text to replace        FT to rep
1  second text to replace  2nd text to rep

Check [Replace values in pandas Series with dictionary](https://stackoverflow.com/questions/40075106/replace-values-in-pandas-series-with-dictionary). — Wiktor Stribiżew, Sep 15 '17 at 11:24
Thank you Wiktor, i now see the regexp=True idea, but it is much slower than simple approaches in the head post. — Alexey Trofimov, Sep 15 '17 at 11:55

score 3 · Answer 1 · answered Apr 19 '18 at 10:42

Ive overcome this again and found a fantastic library called flashtext.

Speedup on 10M records with 15k vocabulary is about x100 (really one hundred times faster than regexp or other approaches from my first post)!

Very easy to use:

df = pd.DataFrame({ "old":
                    ["first text to replace",
                   "second text to replace"]
                    })

mapping = {"first text": "FT", 
           "replace": "rep",
           "second": '2nd'}

import flashtext
processor = flashtext.KeywordProcessor()

for k, v in mapping.items():
    processor.add_keyword(k, v)

print(list(map(processor.replace_keywords, df["old"])))

Result:

['FT to rep', '2nd text to rep']

Also flexible adaptation to different languages if needed, using processor.non_word_boundaries attribute.

Trie-based search used in here gives amazing speedup.

score 2 · Answer 2 · answered Sep 15 '17 at 11:46

One solution would have been to convert the dictionary to a trie and write the code so that you only pass once through the modified text.

Basically, you advance through the text and the trie one character at a time, and as soon a match is found, you replace it.

Of course, if you need to apply the replacements also to already replaced text, this is harder.

Bharath M Shetty · Answer 3 · 2017-09-15T11:50:13.470

1

I think you are looking for replace with regex on df i.e

If you hava dictionary then pass it as a parameter.

d = {'old substring':'new substring','anohter':'another'}

For entire dataframe

df.replace(d,regex=True)

For series

df[columns].replace(d,regex=True)

Example

df = pd.DataFrame({ "old":
                ["first text to replace",
               "second text to replace"]
                })

mapping = {"first text": "FT", 
       "replace": "rep",
       "second": '2nd'}

df['replaced'] = df['old'].replace(mapping,regex=True)

edited Sep 15 '17 at 11:50

answered Sep 15 '17 at 11:31

Bharath M Shetty

30,075
6
57
108

You can pass the dictionary directly – Bharath M Shetty Sep 15 '17 at 11:39
Thank you. It is much slower method unfortunately. ~100 row per second. – Alexey Trofimov Sep 15 '17 at 11:42
@AlexeyTrofimov Try `regex=False` – cs95 Sep 15 '17 at 11:43
@cᴏʟᴅsᴘᴇᴇᴅ He want to replace substrings of a string. – Bharath M Shetty Sep 15 '17 at 11:44
@Bharathshetty You don't need regex for simple substitution – cs95 Sep 15 '17 at 11:45
I thought he has a rows of texts and want to replace in between the text with a dictionary of substitutions . – Bharath M Shetty Sep 15 '17 at 11:46
I added toy examples in the first post – Alexey Trofimov Sep 15 '17 at 11:48
@Bharath shetty, thank you very much for the solution (i vote it up for sure) but i wonder how to make this replace as fast as possible. Naive .apply(loop replace) is significantly faster. – Alexey Trofimov Sep 15 '17 at 11:53
In this case apply is faster than builtin `replace` your approach is really nice. – Bharath M Shetty Sep 15 '17 at 12:13
Seems like this operation cant be strongly improved (say, x2 times) using Python. BTW I also tried R's gsub and it shows almost the same results. – Alexey Trofimov Sep 15 '17 at 12:18

Fastest way to replace substrings with dictionary (On large dataset)

3 Answers3