Increase regex replace speed

Question

I have a df with 20 million records as string text df['text'] and over 100 regex to run against each record to perform replace.

This is taking too long and unfortunately I cannot use flashtext with regex.

Any advice on how to speed this up?

Below is an example of what I am doing now:

a = re.compile(u'\d{11}')
b = re.compile(u'[a-z]{1}\d{3}')
c = re.compile(u'\d{1}-[a-z]{5}-\d{1}')

for rows in df:
    df['text'] = df['text'].str.replace(a,'', regex = True )
    df['text'] = df['text'].str.replace(b,'', regex = True )
    df['text'] = df['text'].str.replace(b,'', regex = True )

It's taking forever because you're updating the entire df 20 million times. — Barmar, May 27 '22 at 15:46
What dataframe library are you using, Pandas? Please add the tag for it. — wjandrea, May 27 '22 at 15:49
Cf. [this answer](https://stackoverflow.com/a/55557758/15873043) on how to (and why you shouldn't) iterate over rows in a dataframe. — fsimonjetz, May 27 '22 at 15:50

score 4 · Accepted Answer · answered May 27 '22 at 15:49

It's taking forever because you're updating the entire dataframe 20 million times. There's no need for the loop, the assignment operates on the whole df, not one row at a time.

Also, you can do all the replacements at once by combining the regular expressions using alternatives with pipes.

df['text'] = df['text'].src.replace(r'\d{11}|[a-z]\d{3}|\d-[a-z]{5}-\d', '', regex=True)

There's no need for {1} in the regular expressions. A pattern matches exactly one time unless you quantify it otherwise.

Increase regex replace speed

1 Answers1