0

I have a list of 4,000 strings that I need to remove from a pandas dataframe column. The code I have below works fine for the sample I have below, but when I use it on my pandas dataframe of 20k+ rows, it takes forever. Any ideas on speeding this up?

import pandas as pd
import re

df = pd.DataFrame(
    {
        "ID": [1, 2, 3, 4, 5],
        "name": [
            "Hello Sam how is it going today? oh yeah",
            "Hello Jane how is it going today? oh yeah",
            "It is an Hello example how are you doing today?",
            "how is it going today?n[soldjgf   ",
            "how is it going today Hello World",
        ],
    }
)


my_list = ['how is it going today?n[soldjgf', 'how are you doing today?']
# =============================================================================
# 
p = re.compile('|'.join(map(re.escape, my_list)))
df['cleaned_text'] = [p.sub(' ', text) for text in df['name']] 
stovfl
  • 14,998
  • 7
  • 24
  • 51
codingInMyBasement
  • 728
  • 1
  • 6
  • 20
  • You could try this: https://stackoverflow.com/a/42747503/4001592 – Dani Mesejo Oct 19 '19 at 02:13
  • 1
    `str.replace()` might be faster in this case. Also, `.apply` will be much faster than list comprehension (although it is probably not the main source of overhead). – Marat Oct 19 '19 at 02:16
  • Are the column entries long strings that have to be searched for multiple banned values? Or will they just consist of a banned value by itself, when banning is needed? – Karl Knechtel Oct 19 '19 at 02:41
  • @KarlKnechtel - multiple banned values. There is other text within the long strings in the column entries that I need to keep, just eliminate the ones that contained within my list. – codingInMyBasement Oct 19 '19 at 02:59

1 Answers1

1

use df.str.replace()

p = re.compile('|'.join(map(re.escape, my_list)))

df['cleaned_text'] = df['name'].str.replace(p, ' ')
RootTwo
  • 4,288
  • 1
  • 11
  • 15