4

I have 10M texts (fits in RAM) and a python dictionary of a kind:

"old substring":"new substring"

The size of a dictionary is ~15k substrings.

I am looking for the FASTEST way to replace each text with the dict (to find every "old substring" in every text and to replace it with "new substring").

The source texts are in pandas dataframe. For now i have tried these approaches:

1) Replace in a loop with reduce and str replace (~120 rows/sec)

replaced = []
for row in df.itertuples():
    replaced.append(reduce(lambda x, y: x.replace(y, mapping[y]), mapping, row[1]))

2) In loop with simple replace function ("mapping" is the 15k dict) (~160 rows/sec):

def string_replace(text):
    for key in mapping:
        text = text.replace(key, mapping[key])
    return text

replaced = []
for row in tqdm(df.itertuples()):
    replaced.append(string_replace(row[1]))

Also .iterrows() works 20% slower than .itertuples()

3) Using apply on Series (also ~160 rows/sec):

replaced = df['text'].apply(string_replace)

With these speed it take hours to process the whole dataset.

Anyone has experience with this kind of mass substring replacements? Is it possible to speed it up? It can be tricky or ugly but have to be as fast as possible, not necessary using pandas.

Thanks.

UPDATED: Toy data to check the idea:

df = pd.DataFrame({ "old":
                    ["first text to replace",
                   "second text to replace"]
                    })

mapping = {"first text": "FT", 
           "replace": "rep",
           "second": '2nd'}

result expected:

                      old         replaced
0   first text to replace        FT to rep
1  second text to replace  2nd text to rep
Alexey Trofimov
  • 4,287
  • 1
  • 18
  • 27
  • 2
    Check [Replace values in pandas Series with dictionary](https://stackoverflow.com/questions/40075106/replace-values-in-pandas-series-with-dictionary). – Wiktor Stribiżew Sep 15 '17 at 11:24
  • Thank you Wiktor, i now see the regexp=True idea, but it is much slower than simple approaches in the head post. – Alexey Trofimov Sep 15 '17 at 11:55

3 Answers3

3

Ive overcome this again and found a fantastic library called flashtext.

Speedup on 10M records with 15k vocabulary is about x100 (really one hundred times faster than regexp or other approaches from my first post)!

Very easy to use:

df = pd.DataFrame({ "old":
                    ["first text to replace",
                   "second text to replace"]
                    })

mapping = {"first text": "FT", 
           "replace": "rep",
           "second": '2nd'}

import flashtext
processor = flashtext.KeywordProcessor()

for k, v in mapping.items():
    processor.add_keyword(k, v)

print(list(map(processor.replace_keywords, df["old"])))

Result:

['FT to rep', '2nd text to rep']

Also flexible adaptation to different languages if needed, using processor.non_word_boundaries attribute.

Trie-based search used in here gives amazing speedup.

Alexey Trofimov
  • 4,287
  • 1
  • 18
  • 27
2

One solution would have been to convert the dictionary to a trie and write the code so that you only pass once through the modified text.

Basically, you advance through the text and the trie one character at a time, and as soon a match is found, you replace it.

Of course, if you need to apply the replacements also to already replaced text, this is harder.

Vasile Rotaru
  • 452
  • 4
  • 11
1

I think you are looking for replace with regex on df i.e

If you hava dictionary then pass it as a parameter.

d = {'old substring':'new substring','anohter':'another'}

For entire dataframe

df.replace(d,regex=True)

For series

df[columns].replace(d,regex=True)

Example

df = pd.DataFrame({ "old":
                ["first text to replace",
               "second text to replace"]
                })

mapping = {"first text": "FT", 
       "replace": "rep",
       "second": '2nd'}

df['replaced'] = df['old'].replace(mapping,regex=True)
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108