1

I have a df with 20 million records as string text df['text'] and over 100 regex to run against each record to perform replace.

This is taking too long and unfortunately I cannot use flashtext with regex.

Any advice on how to speed this up?

Below is an example of what I am doing now:

a = re.compile(u'\d{11}')
b = re.compile(u'[a-z]{1}\d{3}')
c = re.compile(u'\d{1}-[a-z]{5}-\d{1}')

for rows in df:
    df['text'] = df['text'].str.replace(a,'', regex = True )
    df['text'] = df['text'].str.replace(b,'', regex = True )
    df['text'] = df['text'].str.replace(b,'', regex = True )
   
l217
  • 83
  • 9

1 Answers1

4

It's taking forever because you're updating the entire dataframe 20 million times. There's no need for the loop, the assignment operates on the whole df, not one row at a time.

Also, you can do all the replacements at once by combining the regular expressions using alternatives with pipes.

df['text'] = df['text'].src.replace(r'\d{11}|[a-z]\d{3}|\d-[a-z]{5}-\d', '', regex=True)

There's no need for {1} in the regular expressions. A pattern matches exactly one time unless you quantify it otherwise.

Barmar
  • 741,623
  • 53
  • 500
  • 612